Machine learning (summary)

A useful summary of the machine learning techniques using sklearn. This summary was created by Udacity’s data scientists.

Dataset/Question

  • Do I have enough data?
  • Can I define a question?
  • Do I have enough or the right features to answer the question?

Features

Features are simply variables, information that can be quantified and recorded.

Feature creation

  • Think about it like a human

Feature exploration

  • Inspect for correlations
  • Outlier removal
  • Imputation (process of replacing missing data with substituted values)
  • Cleaning

Feature representation

  • Text vectorization
  • Discretization

Feature selection

  • K-best
  • Percentile
  • Recursive feature elimination

Feature scaling

  • Mean substraction
  • Min-Max scaler
  • Standard scaler

Feature transform

  • Principal component analysis (PCA)
  • Independent component analysis (ICA)

Algorithm selection

The algorithms used to analyze the features. The algorithms can be supervised when there is labeled data, otherwise you can use unsupervised algorithms.

Supervised (when there is labeled data)

Ordered or continuos output
  • Linear regression
  • Lasso regression
  • Decision tree regression
  • SV regression
Ordered or continuos output
  • Linear regression
  • Lasso regression
  • Decision tree regression
  • Support Vector Regression (SVR)
Non-ordered or discrete output

Unsupervised (no labeled data)

Tuning algorithm

  • Adjust the params of the algorithm to improve results
  • Visual inspection
  • Performance on test data
  • GridSearchCV

Evaluation

Pick metric(s)

  • SSE/r**2
  • Precision
  • Recall
  • F1 score
  • ROC curve
  • Custom
  • bias/variance

Validate

  • Train/test split
  • K-fold
  • Visualize

Leave a Reply

Your email address will not be published. Required fields are marked *