Introduction à Spark#
Articles
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica
Deep Dive into Catalyst, Catalyst — Tree Manipulation Framework
What is Tungsten for Apache Spark?, Project Tungsten: Bringing Apache Spark Closer to Bare Metal
Articles un peu plus éloignés
Stochastic Gradient Descent Tricks, Léon Bottou
A Fast Distributed Stochastic Gradient Descent Algorithm for Matrix Factorization, Fanglin Li, Bin Wu, Liutong Xu, Chuan Shi, Jing Shi
Parallelized Stochastic Gradient Descent, Martin A. Zinkevich, Markus Weimer, Alex Smola, Lihong Li
Topic Similarity Networks: Visual Analytics for Large Document Sets, Arun S. Maiya, Robert M. Rolfe
Low-dimensional Embeddings for Interpretable Anchor-based Topic Inference, Moontae Lee, David Mimno
K-means on Azure, Matthieu Durut, Fabrice Rossi
Confidence intervals for AB-test, Cyrille Dubarry
FAQ
Modules
spark-sklearn : implémentation d’un grid search distribué pour scikit-learn.
turicreate : mélange de deep learning et de spark
Autres librairies / outils
Hadoop : système de fichier distribué + Map Reduce simple
Kafka : distributed streaming platform, conçu pour stocker et récupérer en temps réel des événements de sites web
Mesos : Apache Mesos abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), Elixi
MLlib : distributed machine learning for Spark
Parquet : Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem.
Presto : Distributed SQL Query Engine for Big Data (Facebook)
Spark : Map Reduce, minimise les accès disques, (DPark clone Python de Spark)
Spark SQL : SQL distribué, sur couche de Spark
Storm : Apache Storm is a free and open source distributed realtime computation system, conçu pour distribuer des pipelines de traitements de données
YARN : Ressource negociator