A useful summary of the machine learning techniques using sklearn. This summary was created by Udacity’s data scientists.
Dataset/Question
- Do I have enough data?
- Can I define a question?
- Do I have enough or the right features to answer the question?
Features
Features are simply variables, information that can be quantified and recorded.
Feature creation
- Think about it like a human
Feature exploration
- Inspect for correlations
- Outlier removal
- Imputation (process of replacing missing data with substituted values)
- Cleaning
Feature representation
- Text vectorization
- Discretization
Feature selection
- K-best
- Percentile
- Recursive feature elimination
Feature scaling
- Mean substraction
- Min-Max scaler
- Standard scaler
Feature transform
- Principal component analysis (PCA)
- Independent component analysis (ICA)
Algorithm selection
The algorithms used to analyze the features. The algorithms can be supervised when there is labeled data, otherwise you can use unsupervised algorithms.
Supervised (when there is labeled data)
Ordered or continuos output
- Linear regression
- Lasso regression
- Decision tree regression
- SV regression
Ordered or continuos output
- Linear regression
- Lasso regression
- Decision tree regression
- Support Vector Regression (SVR)
Non-ordered or discrete output
- Naive Bayes
- Decision tree
- Suport vector machines (SVM)
- Ensemble metnhods
- K nearest neighbors
- Logistic regression
- Linear Discriminant Analysis (LDA)
- Support Vector Regression (SVR)
Unsupervised (no labeled data)
- K-means clustering
- Spectral clustering
- Principal component analysis (PCA)
- Mixture models/EM algorithm
- Outlier detection
Tuning algorithm
- Adjust the params of the algorithm to improve results
- Visual inspection
- Performance on test data
- GridSearchCV
Evaluation
Pick metric(s)
- SSE/r**2
- Precision
- Recall
- F1 score
- ROC curve
- Custom
- bias/variance
Validate
- Train/test split
- K-fold
- Visualize