==================== Introduction à Spark ==================== .. toctree:: :maxdepth: 1 spark_install ../notebooks/spark_first_steps ../notebooks/spark_matrix_3_columns ../notebooks/spark_mllib *Articles* * `Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing `_, Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica * `From scikit-learn to Spark ML `_ * `Deep Dive into Catalyst `_, `Catalyst — Tree Manipulation Framework `_ * `What is Tungsten for Apache Spark? `_, `Project Tungsten: Bringing Apache Spark Closer to Bare Metal `_ * `Spark SQL: Another 16x times faster after Tungsten `_ * `Databricks `_ *Articles un peu plus éloignés* * `Stochastic Gradient Descent Tricks `_, Léon Bottou * `A Fast Distributed Stochastic Gradient Descent Algorithm for Matrix Factorization `_, Fanglin Li, Bin Wu, Liutong Xu, Chuan Shi, Jing Shi * `Parallelized Stochastic Gradient Descent `_, Martin A. Zinkevich, Markus Weimer, Alex Smola, Lihong Li * `Topic Similarity Networks: Visual Analytics for Large Document Sets `_, Arun S. Maiya, Robert M. Rolfe * `Low-dimensional Embeddings for Interpretable Anchor-based Topic Inference `_, Moontae Lee, David Mimno * `K-means on Azure `_, Matthieu Durut, Fabrice Rossi * `Confidence intervals for AB-test `_, Cyrille Dubarry * `Tutorial: Spark-GPU Cluster Dev in a Notebook `_ *FAQ* * `Avoid GroupByKey `_ * `What is the difference between cache and persist? `_ *Modules* * `spark-sklearn `_ : implémentation d'un grid search distribué pour `scikit-learn `_. * `turicreate `_ : mélange de deep learning et de *spark* *Autres librairies / outils* * `Hadoop `_ : système de fichier distribué + Map Reduce simple * `Kafka `_ : distributed streaming platform, conçu pour stocker et récupérer en temps réel des événements de sites web * `Mesos `_ : Apache Mesos abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), `Elixi `_ * `MLlib `_ : distributed machine learning for Spark * `Parquet `_ : Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem. * `Presto `_ : Distributed SQL Query Engine for Big Data (Facebook) * `Spark `_ : Map Reduce, minimise les accès disques, (`DPark `_ clone Python de Spark) * `Spark SQL `_ : SQL distribué, sur couche de Spark * `Storm `_ : Apache Storm is a free and open source distributed realtime computation system, conçu pour distribuer des pipelines de traitements de données * `YARN `_ : Ressource negociator