{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# TD7 - Analyse de texte \n", "\n", "Analyse de texte, TF-IDF, LDA, moteur de recherche, expressions r\u00e9guli\u00e8res."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", ""], "text/plain": ["\n", " | resolved_url | \n", "resolved_title | \n", "excerpt | \n", "tags | \n", "
---|---|---|---|---|
1883956314 | \n", "http://www.xavierdupre.fr/app/teachpyx/helpsph... | \n", "Types et variables du langage python\u00b6 | \n", "Il est impossible d\u2019\u00e9crire un programme sans u... | \n", "{'python': {'item_id': '1883956314', 'tag': 'p... | \n", "
1895830689 | \n", "https://www.pluralsight.com/paths/javascript?a... | \n", "JavaScript | \n", "ES6 is a major update to the JavaScript langua... | \n", "NaN | \n", "
1916603293 | \n", "http://www.seloger.com/annonces/locations/appa... | \n", "Location Appartement 56,88m\u00b2 Asnieres-sur-Sein... | \n", "Prix au m\u00b2 fourni \u00e0 titre indicatif, seul un p... | \n", "NaN | \n", "
1916600800 | \n", "http://www.seloger.com/annonces/locations/appa... | \n", "Location Appartement 82m\u00b2 Asnieres sur Seine -... | \n", "Trouvez votre bien \u00e0 tout moment ... | \n", "NaN | \n", "
1916598390 | \n", "http://www.seloger.com/annonces/locations/appa... | \n", "Location Appartement 93,6m\u00b2 Asni\u00e8res-sur-Seine | \n", "Trouvez votre bien \u00e0 tout moment ... | \n", "NaN | \n", "
\n", " | tags | \n", "url | \n", "excerpt | \n", "title | \n", "domain | \n", "html_soup | \n", "
---|---|---|---|---|---|---|
0 | \n", "['mobile app'] | \n", "https://www.grafikart.fr/tutoriels/cordova/ion... | \n", "Ionic est un framework qui va vous permettre d... | \n", "Tutoriel Vid\u00e9o Apache CordovaIonic Framework | \n", "grafikart.fr | \n", "{'h2': ['Petit', 'tour', 'du', 'propri\u00e9taire']... | \n", "
1 | \n", "['lewagon'] | \n", "http://www.colorhunt.co | \n", "Home Create Likes () About Add To Chrome Faceb... | \n", "Color Hunt | \n", "colorhunt.co | \n", "{} | \n", "
2 | \n", "['data science'] | \n", "https://jakevdp.github.io/blog/2015/08/14/out-... | \n", "In recent months, a host of new tools and pack... | \n", "Out-of-Core Dataframes in Python: Dask and Ope... | \n", "jakevdp.github.io | \n", "{'h2': ['Pubs', 'of', 'the', 'British', 'Isles... | \n", "
3 | \n", "['abtest'] | \n", "https://blog.dominodatalab.com/ab-testing-with... | \n", "In this post, I discuss a method for A/B testi... | \n", "A/B Testing with Hierarchical Models in Python | \n", "blog.dominodatalab.com | \n", "{'h2': ['Recent', 'Posts'], 'h3': ['Related'],... | \n", "
4 | \n", "['mdn', 'documentation'] | \n", "https://developer.mozilla.org/en-US/docs/Learn... | \n", "Getting started with the Web is a concise seri... | \n", "Getting started with the Web | \n", "developer.mozilla.org | \n", "{'h2': ['Mozilla'], 'h3': ['How', 'the', 'web'... | \n", "
\n", " | tags | \n", "url | \n", "excerpt | \n", "title | \n", "domain | \n", "html_soup | \n", "
---|---|---|---|---|---|---|
0 | \n", "[mobile app] | \n", "https://www.grafikart.fr/tutoriels/cordova/ion... | \n", "Ionic est un framework qui va vous permettre d... | \n", "Tutoriel Vid\u00e9o Apache CordovaIonic Framework | \n", "grafikart.fr | \n", "{'h2': ['Petit', 'tour', 'du', 'propri\u00e9taire']... | \n", "
1 | \n", "[lewagon] | \n", "http://www.colorhunt.co | \n", "Home Create Likes () About Add To Chrome Faceb... | \n", "Color Hunt | \n", "colorhunt.co | \n", "{} | \n", "
2 | \n", "[data science] | \n", "https://jakevdp.github.io/blog/2015/08/14/out-... | \n", "In recent months, a host of new tools and pack... | \n", "Out-of-Core Dataframes in Python: Dask and Ope... | \n", "jakevdp.github.io | \n", "{'h2': ['Pubs', 'of', 'the', 'British', 'Isles... | \n", "
3 | \n", "[abtest] | \n", "https://blog.dominodatalab.com/ab-testing-with... | \n", "In this post, I discuss a method for A/B testi... | \n", "A/B Testing with Hierarchical Models in Python | \n", "blog.dominodatalab.com | \n", "{'h2': ['Recent', 'Posts'], 'h3': ['Related'],... | \n", "
4 | \n", "[mdn, documentation] | \n", "https://developer.mozilla.org/en-US/docs/Learn... | \n", "Getting started with the Web is a concise seri... | \n", "Getting started with the Web | \n", "developer.mozilla.org | \n", "{'h2': ['Mozilla'], 'h3': ['How', 'the', 'web'... | \n", "
\n", " | tags | \n", "url | \n", "excerpt | \n", "title | \n", "domain | \n", "html_soup | \n", "words_string | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "[mobile app] | \n", "https://www.grafikart.fr/tutoriels/cordova/ion... | \n", "Ionic est un framework qui va vous permettre d... | \n", "Tutoriel Vid\u00e9o Apache CordovaIonic Framework | \n", "grafikart.fr | \n", "{'h2': ['Petit', 'tour', 'du', 'propri\u00e9taire']... | \n", "grafikart tutoriels cordova ionic framework tu... | \n", "
1 | \n", "[lewagon] | \n", "http://www.colorhunt.co | \n", "Home Create Likes () About Add To Chrome Faceb... | \n", "Color Hunt | \n", "colorhunt.co | \n", "{} | \n", "colorhunt color hunt home create likes about a... | \n", "
2 | \n", "[data science] | \n", "https://jakevdp.github.io/blog/2015/08/14/out-... | \n", "In recent months, a host of new tools and pack... | \n", "Out-of-Core Dataframes in Python: Dask and Ope... | \n", "jakevdp.github.io | \n", "{'h2': ['Pubs', 'of', 'the', 'British', 'Isles... | \n", "jakevdp core dataframes python outofcore dataf... | \n", "
3 | \n", "[abtest] | \n", "https://blog.dominodatalab.com/ab-testing-with... | \n", "In this post, I discuss a method for A/B testi... | \n", "A/B Testing with Hierarchical Models in Python | \n", "blog.dominodatalab.com | \n", "{'h2': ['Recent', 'Posts'], 'h3': ['Related'],... | \n", "dominodatalab ab testing hierarchical models p... | \n", "
4 | \n", "[mdn, documentation] | \n", "https://developer.mozilla.org/en-US/docs/Learn... | \n", "Getting started with the Web is a concise seri... | \n", "Getting started with the Web | \n", "developer.mozilla.org | \n", "{'h2': ['Mozilla'], 'h3': ['How', 'the', 'web'... | \n", "developer mozilla us docs learn gettingstarted... | \n", "