{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# Revue de comp\u00e9titions Kaggle (2016)\n", "\n", "Les gagnants des comp\u00e9titions [Kaggle](https://www.kaggle.com/) d\u00e9crivent parfois leurs solutions sur le blog de Kaggle [No Free Hunch](http://blog.kaggle.com/). Il y a toujours de bonnes id\u00e9es \u00e0 glaner."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["%matplotlib inline\n", "from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Le [blog Kaggle](http://blog.kaggle.com/) publie r\u00e9guli\u00e8rement des interviews des gagnants des comp\u00e9titions. C'est l'occasion de d\u00e9couvrir la meilleur solution et les outils qui ont permis de la mettre en place. Certains sujets sont des comp\u00e9titions acad\u00e9miques et les gagnants mettent parfois leur code \u00e0 disposition sous Github."]}, {"cell_type": "markdown", "metadata": {"collapsed": true}, "source": ["## The Allen AI Science Challenge\n", "\n", "[kaggle](https://www.kaggle.com/c/the-allen-ai-science-challenge)\n", "\n", "* **Objectif :** pr\u00e9dire la bonne r\u00e9ponse \u00e0 un QCM\n", "* **donn\u00e9es :** des QCM et leurs r\u00e9ponses\n", "\n", "[The Allen AI Science Challenge, Winner's Interview: 3rd place, Alejandro Mosquera](http://blog.kaggle.com/2016/04/09/the-allen-ai-science-challenge-winners-interview-3rd-place-alejandro-mosquera/) (lire aussi [r\u00e9sum\u00e9 des trois solutions](https://gist.github.com/vihari/32b11ad1fac001cfab5981430ad8f36c)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Type de features\n", "\n", "* ES_raw_lemma: IR scores by using ES and raw/lemmatized KB.\n", "* ES_lemma_regex: Regex scoring (number of characters matched) after IR results.\n", "* W2V_COSQA: Cosine similarity between question and answer embeddings.\n", "* CAT_A1: Is multiphrase question + short response?\n", "* CAT_A2: Is fill the _______ + no direct question + long response?\n", "* CAT_A3: Is multiphrase question + long response?\n", "* ANS_all_above: Is \"all of the above\" answer?\n", "* ANS_none_above: Is \"none of the above\" answer?\n", "* ANS_both: Is \"both X and Y\" answer?\n", "* ES_raw_lemmaLMJM: IR scores by using ES and raw/lemmatized KB with LMJM scoring.\n", "* ES_lemma_regexLMJM: Regex scoring (number of characters matched) after IR results using LMJM.\n", "\n", "ES = Elastic search\n", "\n", "Sur GitHub : [amsqr/Allen_AI_Kaggle](https://github.com/amsqr/Allen_AI_Kaggle)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Id\u00e9es \u00e0 r\u00e9cup\u00e9rer\n", "\n", "* ajout de donn\u00e9es provenant de sources ex\u00e9terieures aux probl\u00e8mes (voir la liste des [ressources ajout\u00e9es par le vainqueur](https://github.com/Cardal/Kaggle_AllenAIscience/blob/master/README.txt#L23)).\n", "* calcul de statistiques sur des corpus plus grand que les donn\u00e9es du probl\u00e8me\n", "* Code disponible sur GitHub : des exemples \u00e0 r\u00e9cup\u00e9rer\n", "* Utilisation de [BM25](https://en.wikipedia.org/wiki/Okapi_BM25), version am\u00e9lior\u00e9e du [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Predicting Red Hat Business Value\n", "\n", "[kaggle](https://www.kaggle.com/c/predicting-red-hat-business-value)\n", "\n", "* **Objectif :** d\u00e9terminer le potentiel d'un client, ce potentiel est d\u00e9termin\u00e9 par la r\u00e9alisation d'un \u00e9v\u00e9nement dans une fen\u00eatre de temps\n", "* **Donn\u00e9es :** les donn\u00e9es contiennent des informations sur les utilisateurs et sur leurs actions\n", "\n", "[Red Hat Business Value Competition, 1st Place Winner's Interview: Darius Baru\u0161auskas](http://blog.kaggle.com/2016/11/03/red-hat-business-value-competition-1st-place-winners-interview-darius-barusauskas/)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Points essentiels\n", "\n", "* **leakage** (1) : il y avait un *data leakage* pour une partie de la base, le gagnant a choisi d'utiliser deux strat\u00e9gies diff\u00e9rentes sur ces deux parties\n", "* **leakage** (2) : le gagnant a utilis\u00e9 cette fuite pour construire un mod\u00e8le qui puisse servir pour les donn\u00e9es qui ne b\u00e9n\u00e9ficiaient pas cette fuite\n", "* **features agr\u00e9g\u00e9es :** il fallait construire des statistiques agr\u00e9g\u00e9es par la compagnie du client de Red Hat (plusieurs clients pour une m\u00eame compagnie)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## TalkingData Mobile User Demographics\n", "\n", "[kaggle](https://www.kaggle.com/c/talkingdata-mobile-user-demographics)\n", "\n", "* **Objectif :** d\u00e9crire une personne (sexe, \u00e2ge) en fonction de l'utilisation de son t\u00e9l\u00e9phone\n", "* **Donn\u00e9es :** d\u00e9cription des t\u00e9l\u00e9phones et des \u00e9v\u00e9nements se rapportant \u00e0 plusieurs personnes\n", "\n", "[TalkingData Mobile User Demographics Competition, Winners' Interview: 3rd Place, Team utc(+1,-3) | Danijel & Matias](http://blog.kaggle.com/2016/10/19/talkingdata-mobile-user-demographics-competition-winners-interview-3rd-place-team-utc1-3-danijel-matias/)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Id\u00e9es int\u00e9ressantes\n", "\n", "* TF-IDF : appliqu\u00e9 dans un cas non typique (marques, mod\u00e8les de t\u00e9l\u00e9phones)\n", "* utilisation de [keras](https://www.kaggle.com/chechir/talkingdata-mobile-user-demographics/keras-on-labels-and-brands)\n", "* utilisation de [matrice sparse](https://en.wikipedia.org/wiki/Sparse_matrix)\n", "* xgboost et les r\u00e9seaux de neurones ont obtenu leurs meilleurs performances avec des features diff\u00e9rentes\n", "* mod\u00e9lisation du probl\u00e8me : pr\u00e9dire d'abord le genre puis utiliser ce r\u00e9sultat comme feature pour pr\u00e9dire la classe d'\u00e2ge $P(A_i,F)=P(A_i|F)P(F)$\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Grupo Bimbo Inventory Demand\n", "\n", "[kaggle](https://www.kaggle.com/c/grupo-bimbo-inventory-demand)\n", "\n", "* **Objectif :** pr\u00e9dire la demande (limiter les stocks, limiter la surproduction)\n", "* **Donn\u00e9es :** ventes pass\u00e9es\n", "\n", "[Grupo Bimbo Inventory Demand, Winners' Interview: \n", "Clustifier & Alex & Andrey](http://blog.kaggle.com/2016/09/27/grupo-bimbo-inventory-demand-winners-interviewclustifier-alex-andrey/)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### faits notables\n", "\n", "* Truncated SVD on TF-IDF matrix of client and product names\n", "* Soin particulier apport\u00e9s aux features : \n", "* utilisation de model [FTRL](http://www.eecs.tufts.edu/~dsculley/papers/ad-click-prediction.pdf), FFM : [script](https://www.kaggle.com/jiweiliu/springleaf-marketing-response/ftrl-starter-code) : le mod\u00e8le FTRL vient du domaine des publicit\u00e9s sur internet pour lesquels il faut pr\u00e9dire la probabilit\u00e9 d'un click. FTRL est un mod\u00e8le de [online training](https://en.wikipedia.org/wiki/Online_machine_learning) qui permet de mettre \u00e0 jour le mod\u00e8le au fur et \u00e0 mesure que les donn\u00e9es sont d\u00e9couvertes. Cela suppose que les donn\u00e9es sont s\u00e9quentielles dans le temps.\n", "* le gagnant a suivi le cours de [Alexander D'yakonov](http://alexanderdyakonov.narod.ru/engpapers.htm) qui a \u00e9crit [Two Recommendation Algorithms Based on Deformed Linear Combinations](http://ceur-ws.org/Vol-770/paper5.pdf)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Facebook V: Predicting Check Ins\n", "\n", "[kaggle](https://www.kaggle.com/c/facebook-v-predicting-check-ins)\n", "\n", "* **Objectif :** on conna\u00eet (x, y, location accuracy, timestamp) et il faut pr\u00e9dire un business id\n", " \n", "Trois solutions :\n", " \n", "* [Facebook V: Predicting Check Ins, Winner's Interview: 1st Place, Tom Van de Wiele](http://blog.kaggle.com/2016/08/16/facebook-v-predicting-check-ins-winners-interview-1st-place-tom-van-de-wiele/)\n", "* [Facebook V: Predicting Check Ins, Winner's Interview: 2nd Place, Markus Kliegl](http://blog.kaggle.com/2016/08/02/facebook-v-predicting-check-ins-winners-interview-2nd-place-markus-kliegl/)\n", "* [Facebook V: Predicting Check Ins, Winner's Interview: 3rd Place, Ryuji Sakata](http://blog.kaggle.com/2016/08/18/facebook-v-predicting-check-ins-winners-interview-3rd-place-ryuji-sakata/)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Obstacles et solutions\n", "\n", "* multi-class : impossible d'utiliser un mod\u00e8le multi-class, trop de choix, la performance est mauvaise\n", "* approche 1 : transformer le probl\u00e8me en un probl\u00e8me de ranking (approche moteur de recherche), des heuristiques produisent 20 candidats qu'un mod\u00e8le viendra *\"scorer\"*\n", "* approche 2 : $P(business id | x, y, accuracy, time) \\propto P(x, y, accuracy, time |place) P(place)$"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Avito Duplicate Ads Detection\n", "\n", "[kaggle](https://www.kaggle.com/c/avito-duplicate-ads-detection)\n", "\n", "* **Objectif :** trouver des publicit\u00e9s en double dans une base de publicit\u00e9s\n", "* **Donn\u00e9es :** images et textes des publicit\u00e9s\n", "\n", "* [Avito Duplicate Ads Detection, Winners' Interview: 1st Place Team, Devil Team | Stanislav Semenov & Dmitrii Tsybulevskii](http://blog.kaggle.com/2016/08/24/avito-duplicate-ads-detection-winners-interview-1st-place-team-devil-team-stanislav-dmitrii/)\n", "* [Avito Duplicate Ads Detection, Winners' Interview: 2nd Place, Team TheQuants | Mikel, Peter, Marios, & Sonny](http://blog.kaggle.com/2016/08/31/avito-duplicate-ads-detection-winners-interview-2nd-place-team-the-quants-mikel-peter-marios-sonny/)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Points int\u00e9ressants\n", "\n", "* features calcul\u00e9es sur une grande vari\u00e9t\u00e9s de supports (image, texte, titre, description, marques, prix, localisation) --> [grande liste de features](http://blog.kaggle.com/2016/08/31/avito-duplicate-ads-detection-winners-interview-2nd-place-team-the-quants-mikel-peter-marios-sonny/), [weights of evidence](http://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html)\n", "* probl\u00e8me de pr\u00e9paration des donn\u00e9es : les labels \u00e9taient donn\u00e9es sous forme de paires pub i = pub j --> comment s\u00e9par\u00e9s en base d'apprentissage et de test pour \u00e9viter les probl\u00e8mes d'overfitting\n", "* le gagnant a utilis\u00e9 des mod\u00e8les de deep learning pr\u00e9entra\u00een\u00e9s [Full ImageNet Network](https://github.com/dmlc/mxnet-model-gallery/blob/master/imagenet-21k-inception.md), il n'a pas utilis\u00e9 sa sortie mais le r\u00e9sultat d'une couche interm\u00e9diaire"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Draper Satellite Image Chronology\n", "\n", "[kaggle](https://www.kaggle.com/c/draper-satellite-image-chronology)\n", "\n", "* **Objectifs :** ordonner dans le temps des images du m\u00eame lieu\n", "* **Donn\u00e9es :** des images ordonn\u00e9es\n", "\n", "* [Draper Satellite Image Chronology: Pure ML Solution | Vicens Gaitan](http://blog.kaggle.com/2016/09/15/draper-satellite-image-chronology-machine-learning-solution-vicens-gaitan/)\n", "* [Draper Satellite Image Chronology: Pure ML Solution | Damien Soukhavong](http://blog.kaggle.com/2016/09/08/draper-satellite-image-chronology-damien-soukhavong/)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Points int\u00e9ressants\n", "\n", "* le [notebook](https://www.kaggle.com/vicensgaitan/draper-satellite-image-chronology/image-registration-the-r-way/notebook) par l'auteur de la premi\u00e8re solution explique comment faire du matching d'une image et c'est tr\u00e8s clair\n", "* [Fast Explicit Diffusion for Accelerated Features in Nonlinear Scale Spaces](http://isit.u-clermont1.fr/~ab/Publications/Alcantarilla_etal_BMVC13.pdf), [AKAZE](http://docs.opencv.org/3.0-beta/doc/tutorials/features2d/akaze_matching/akaze_matching.html#akaze)\n", "* [RANSAC](https://en.wikipedia.org/wiki/Random_sample_consensus)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Yelp Restaurant Photo Classification\n", "\n", "[kaggle](https://www.kaggle.com/c/yelp-restaurant-photo-classification)\n", "\n", "* **Objectifs :** classer des photos de restaurants, particularit\u00e9, une photo peut avoir plusieurs labels\n", "* **Donn\u00e9es :** des images en entr\u00e9es, des labels \u00e0 pr\u00e9dire en sortie, 0: good_for_lunch 1: good_for_dinner 2: takes_reservations 3: outdoor_seating 4: restaurant_is_expensive 5: has_alcohol 6: has_table_service 7: ambience_is_classy 8: good_for_kids\n", "\n", "[Yelp Restaurant Photo Classification, Winner's Interview: 1st Place, Dmitrii Tsybulevskii](http://blog.kaggle.com/2016/04/28/yelp-restaurant-photo-classification-winners-interview-1st-place-dmitrii-tsybulevskii/)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Points int\u00e9ressants\n", "\n", "* plusieurs id\u00e9es pour traiter le cas multi-label\n", "* [Fisher Vectors](http://www.vlfeat.org/api/fisher-fundamentals.html)\n", "* [TruncatedSVD](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html)\n", "* [Multiple Instance Classification: review, taxonomy and comparative study](http://158.109.8.37/files/Amo2013.pdf)\n", "* [Classifier Chains for Multi-label Classification](http://www.cs.waikato.ac.nz/ml/publications/2009/chains.pdf)\n", "* [Random k-Labelsets for Multi-Label Classification](http://lpis.csd.auth.gr/publications/tsoumakas-tkde10.pdf) : m\u00e9thode moins performante pour ce probl\u00e8me"]}, {"cell_type": "code", "execution_count": 2, "metadata": {"collapsed": true}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1"}}, "nbformat": 4, "nbformat_minor": 2}