Homemade Machine Learning: ML algorithms from scratch in Python
Python implementations of popular machine learning algorithms from scratch with interactive Jupyter notebooks and mathematical explanations. Learn the fundamentals by building ML algorithms yourself.
- Step 1
What is Homemade Machine Learning?
Homemade Machine Learning is an educational repository by Oleksii Trekhleb (@trekhleb) that implements popular machine learning algorithms from scratch in Python. Unlike typical ML tutorials that rely on library one-liners, this project focuses on understanding the mathematics and fundamentals behind each algorithm.
The repository contains:
- Pure Python implementations of ML algorithms (no high-level ML libraries)
- Interactive Jupyter Notebook demos for hands-on experimentation
- Mathematical explanations and theory for each algorithm
- Real-world datasets and visualization examples
- Support for both supervised and unsupervised learning
Key Learning Value:
- Understand the math behind ML algorithms
- See how algorithms work step-by-step
- Experiment with training data and hyperparameters in real-time
- Build intuition before using production ML libraries
Note: These implementations are intentionally educational and not optimized for production use. For production ML, use established libraries like scikit-learn, TensorFlow, or PyTorch.
- Step 2
Repository architecture
The repository follows a clean structure that separates algorithm implementations, interactive demos, and supporting data:
Core Structure:
homemade-machine-learning/ ├── homemade/ # Algorithm implementations │ ├── linear_regression/ │ ├── logistic_regression/ │ ├── k_means/ │ ├── neural_network/ │ ├── anomaly_detection/ │ └── utils/ # Shared utilities (features, hypothesis, etc.) ├── notebooks/ # Interactive Jupyter demos │ ├── linear_regression/ │ ├── logistic_regression/ │ ├── k_means/ │ ├── neural_network/ │ └── anomaly_detection/ ├── data/ # Training datasets (CSV files) └── images/ # Documentation assetsOrganization Pattern: Each algorithm follows the same pattern:
- Implementation in
homemade/<algorithm>/with math documentation - One or more demo notebooks in
notebooks/<algorithm>/ - Datasets in
data/referenced by notebooks
This structure makes it easy to:
- Navigate between theory (code) and practice (notebooks)
- Run demos independently
- Compare different algorithm approaches
- Implementation in
- Step 3
Technology stack
The project uses a minimal, focused tech stack centered on scientific Python libraries for numerical computing and visualization.
Core Language:
- Python 3.6+ (originally 3.6, compatible with newer versions)
Scientific Computing:
- NumPy 1.15.3 — Core numerical computing library for matrix operations, linear algebra, and vectorized calculations. The foundation of all algorithm implementations.
- Pandas 0.23.4 — Data manipulation and CSV reading for dataset loading
- SciPy 1.1.0 — Scientific computing utilities (optimization, statistics)
Visualization:
- Matplotlib 3.0.1 — Primary plotting library for 2D charts, scatter plots, decision boundaries
- Plotly 3.4.1 — Interactive 3D visualizations and advanced plots
Development Tools:
- Jupyter 1.0.0 — Interactive notebook environment for running demos
- Pylint 2.1.1 — Python linter for code quality (configured via
pylintrc)
CI/CD:
- Travis CI — Automated linting on commits (configured via
.travis.yml)
Why This Stack? The deliberately minimal dependencies keep the focus on algorithmic fundamentals rather than framework abstractions. NumPy provides the mathematical primitives (matrix multiplication, derivatives, etc.) while Jupyter enables interactive experimentation.
Tech Stack: ├── Python 3.6+ ├── NumPy 1.15.3 (matrix ops, linear algebra) ├── Pandas 0.23.4 (data loading) ├── Matplotlib 3.0.1 (2D plotting) ├── Plotly 3.4.1 (3D visualization) ├── Jupyter 1.0.0 (notebooks) └── Pylint 2.1.1 (code quality) - Step 4
Installation and setup
Getting started with Homemade Machine Learning requires Python 3.6+ and installing the scientific computing dependencies.
Clone the Repository:
git clone https://github.com/trekhleb/homemade-machine-learning.git cd homemade-machine-learningCreate a Virtual Environment (Recommended):
# Using venv (Python 3.6+) python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activateInstall Dependencies:
pip install -r requirements.txtThis installs:
- jupyter (notebook environment)
- matplotlib (visualization)
- numpy (numerical computing)
- pandas (data manipulation)
- plotly (interactive plots)
- scipy (scientific utilities)
- pylint (linting)
Verify Installation:
python -c "import numpy, pandas, matplotlib, jupyter; print('All dependencies installed!')"# Clone repository git clone https://github.com/trekhleb/homemade-machine-learning.git cd homemade-machine-learning # Create virtual environment python3 -m venv venv source venv/bin/activate # Install dependencies pip install -r requirements.txt # Verify installation python -c "import numpy, pandas, matplotlib, jupyter; print('Ready!')" - Step 5
Launching Jupyter notebooks
The repository includes 11 interactive Jupyter notebooks that demonstrate each algorithm with real datasets.
Launch Jupyter Locally:
# From the repository root jupyter notebookThis starts the Jupyter server and opens your browser at
http://localhost:8888. Navigate to thenotebooks/folder to access the demos.Online Options (No Installation Required):
-
NBViewer (Read-Only Preview):
- Fast online preview of notebooks
- View code, charts, and results
- Cannot modify or run code
- All demo links in the README point to NBViewer
-
Binder (Interactive):
- Full interactive notebook environment in your browser
- Can modify code and re-run cells
- Click "Execute on Binder" button in any NBViewer page
- Takes ~2 minutes to build the environment
Notebook Organization: Notebooks are grouped by algorithm type:
linear_regression/— 3 demos (univariate, multivariate, non-linear)logistic_regression/— 4 demos (linear boundary, non-linear, MNIST, Fashion MNIST)k_means/— 1 demo (Iris clustering)neural_network/— 2 demos (MNIST, Fashion MNIST)anomaly_detection/— 1 demo (Gaussian distribution)
# Local execution jupyter notebook # → Opens http://localhost:8888 # → Navigate to notebooks/ folder # Alternative: Launch a specific notebook jupyter notebook notebooks/linear_regression/univariate_linear_regression_demo.ipynb -
- Step 6
Algorithm implementations overview
The repository implements algorithms across three main categories:
Supervised Learning
Regression (predicting continuous values):
- Linear Regression — Draw a line/plane through data points
- Univariate: Single feature prediction
- Multivariate: Multiple features
- Non-linear: Polynomial and sinusoid features
- Use cases: Stock prices, sales forecasting, trend analysis
Classification (categorizing data into classes):
- Logistic Regression — Binary and multi-class classification
- Linear boundaries for simple separation
- Non-linear boundaries using feature engineering
- Multivariate for high-dimensional data (MNIST digits, Fashion MNIST)
- Use cases: Spam detection, language detection, image recognition
Unsupervised Learning
Clustering (grouping similar data):
- K-means — Partition data into K clusters
- Iterative centroid refinement
- Demo: Iris flower clustering
- Use cases: Market segmentation, image compression, data analysis
Anomaly Detection (identifying outliers):
- Gaussian Distribution — Statistical anomaly detection
- Model normal behavior with Gaussian distribution
- Flag rare events based on probability threshold
- Demo: Server monitoring (latency, throughput)
- Use cases: Fraud detection, intrusion detection, system health
Neural Networks
- Multilayer Perceptron (MLP) — Feedforward neural network
- Multiple hidden layers with activation functions
- Backpropagation for training
- Demos: Handwritten digit recognition, clothing classification
- Use cases: General-purpose ML, image recognition, voice recognition
- Linear Regression — Draw a line/plane through data points
- Step 7
Example: Linear regression walkthrough
Let's walk through the univariate linear regression example to understand the structure.
Implementation Location:
homemade/linear_regression/linear_regression.pyThis file contains:
- The
LinearRegressionclass - Hypothesis function (linear equation)
- Cost function (mean squared error)
- Gradient descent optimization
- Prediction method
Demo Notebook:
notebooks/linear_regression/univariate_linear_regression_demo.ipynbWhat the Demo Does:
- Loads a dataset (country happiness scores vs GDP)
- Visualizes the raw data as a scatter plot
- Trains a linear regression model
- Plots the regression line through the data
- Shows cost function convergence over iterations
- Makes predictions on new data points
Key Learning Points:
- See gradient descent in action (cost decreasing)
- Understand how learning rate affects convergence
- Experiment with different features (polynomial, etc.)
- Visualize overfitting vs underfitting
Try It Yourself:
jupyter notebook notebooks/linear_regression/univariate_linear_regression_demo.ipynbModify the learning rate, iterations, or add polynomial features to see how the model changes.
# Example from the implementation from homemade.linear_regression import LinearRegression import numpy as np # Load data (GDP vs Happiness) data = np.loadtxt('data/world-happiness-report-2017.csv', delimiter=',') X = data[:, 0:1] # GDP column y = data[:, 1:2] # Happiness column # Train model model = LinearRegression(X, y) model.train(alpha=0.01, num_iterations=500) # Make predictions predictions = model.predict(X) # Visualize results (see notebook for full plotting code) - The
- Step 8
Example: Neural network MNIST demo
The neural network implementation showcases a more advanced algorithm with the classic MNIST handwritten digit recognition task.
Implementation:
homemade/neural_network/multilayer_perceptron.pyFeatures:
- Configurable layer architecture (input → hidden → output)
- Sigmoid activation functions
- Backpropagation for weight updates
- Mini-batch gradient descent
- Regularization support
Demo Notebook:
notebooks/neural_network/multilayer_perceptron_demo.ipynbDataset:
- 60,000 training images (28×28 pixels)
- 10,000 test images
- 10 digit classes (0-9)
- Each image flattened to 784 features
What You'll Learn:
- How neural networks transform data through layers
- Impact of hidden layer size on accuracy
- Training progress visualization (accuracy over epochs)
- Overfitting detection
- Confusion matrix interpretation
Typical Results:
- Training accuracy: ~95-97%
- Test accuracy: ~93-95%
- Training time: 5-10 minutes (CPU)
Experimentation Ideas:
- Add more hidden layers
- Change layer sizes (128 → 256 neurons)
- Adjust learning rate
- Enable/disable regularization
- Compare with Fashion MNIST dataset
# Neural network configuration example from homemade.neural_network import MultilayerPerceptron # Network architecture layers = [ 784, # Input: 28×28 pixels flattened 128, # Hidden layer: 128 neurons 10 # Output: 10 digit classes ] # Train model model = MultilayerPerceptron(X_train, y_train, layers) model.train( alpha=0.1, # Learning rate lambda_param=0.0, # Regularization num_iterations=500, # Epochs batch_size=100 # Mini-batch size ) # Evaluate accuracy = model.evaluate(X_test, y_test) print(f'Test accuracy: {accuracy:.2%}') - Step 9
Educational approach and learning path
Homemade Machine Learning follows a pedagogical progression from simple to complex algorithms.
Recommended Learning Path:
-
Start with Linear Regression (Easiest)
- Univariate demo first (single feature)
- Then multivariate (multiple features)
- Finally non-linear (feature engineering)
- Builds intuition for cost functions and gradient descent
-
Move to Logistic Regression
- Linear boundary demo (natural extension of linear regression)
- Non-linear boundary (feature engineering revisited)
- Multivariate MNIST (high-dimensional classification)
-
Explore Unsupervised Learning
- K-means clustering (simpler than classification)
- Anomaly detection (introduces probability distributions)
-
Tackle Neural Networks (Most Complex)
- Builds on all previous concepts
- Combines gradient descent, classification, and feature learning
- MNIST provides concrete benchmark
Mathematical Prerequisites:
- Linear algebra (matrices, vectors, dot products)
- Calculus (derivatives, partial derivatives, chain rule)
- Basic probability and statistics
- Understanding of cost functions and optimization
Most Examples Reference: The code and explanations are based on Andrew Ng's Machine Learning course (Coursera), making it easy to cross-reference with video lectures.
Learning Progression: 1. Linear Regression └─ Univariate → Multivariate → Non-linear 2. Logistic Regression └─ Linear boundary → Non-linear → MNIST 3. Unsupervised Learning ├─ K-means clustering └─ Anomaly detection 4. Neural Networks └─ MLP → MNIST → Fashion MNIST -
- Step 10
Datasets included
The repository includes several real-world datasets in the
data/folder:Regression Datasets:
- World Happiness Report 2017 — Country happiness scores with economic indicators (GDP, freedom, generosity, etc.)
- Used for: Linear regression demos
- Features: Economy GDP, social support, life expectancy, freedom
- Target: Happiness score
Classification Datasets:
-
Iris Flower Dataset — Classic ML dataset with 3 flower species
- 150 samples, 4 features (sepal/petal length and width)
- Used for: Logistic regression, K-means clustering
-
MNIST Handwritten Digits — 70,000 grayscale images (60k train, 10k test)
- 28×28 pixels per image
- 10 classes (digits 0-9)
- Used for: Logistic regression, neural networks
-
Fashion MNIST — Alternative to MNIST with clothing items
- Same format as MNIST (28×28 grayscale)
- 10 classes (t-shirt, trouser, dress, coat, sandal, etc.)
- Used for: Logistic regression, neural networks
Anomaly Detection:
- Server Metrics — Synthetic dataset of server operational parameters
- Features: Latency, throughput
- Contains normal and anomalous behavior examples
All datasets are loaded via NumPy or Pandas and include preprocessing examples in the notebooks.
# Example dataset loading patterns # CSV loading with NumPy data = np.loadtxt('data/happiness.csv', delimiter=',') X = data[:, 0:2] # Features y = data[:, 2:3] # Target # Iris dataset via sklearn from sklearn import datasets iris = datasets.load_iris() X = iris.data[:, :2] # First 2 features for visualization y = iris.target # MNIST via keras datasets from keras.datasets import mnist (X_train, y_train), (X_test, y_test) = mnist.load_data() - World Happiness Report 2017 — Country happiness scores with economic indicators (GDP, freedom, generosity, etc.)
- Step 11
Code structure and utilities
The
homemade/utils/directory contains shared utilities used across multiple algorithms:Features Module (
features/):prepare_for_training()— Normalize data and add bias columnnormalize()— Feature scaling (zero mean, unit variance)generate_polynomials()— Create polynomial features for non-linear regressiongenerate_sinusoids()— Create sinusoidal features
Hypothesis Module:
linear_hypothesis()— Linear prediction functionsigmoid()— Logistic activation function
Cost Functions:
- Mean Squared Error (regression)
- Cross-Entropy Loss (classification)
- Regularization terms
Optimization:
- Gradient descent implementation
- Mini-batch gradient descent
- Learning rate scheduling helpers
Plotting Utilities:
- Decision boundary visualization
- Cost function convergence plots
- Confusion matrices
- 3D surface plots for regression
Why This Matters: Understanding these utilities is crucial because they reveal the common patterns across all ML algorithms (feature scaling, cost computation, gradient calculation). The main algorithm classes focus on the unique aspects while delegating these shared concerns to utilities.
# Example utility usage from homemade.utils.features import prepare_for_training # Normalize features and add bias X_normalized, features_mean, features_std = prepare_for_training(X) # Generate polynomial features (degree 2) from homemade.utils.features import generate_polynomials X_poly = generate_polynomials(X, polynomial_degree=2) # Common pattern in all algorithms: # 1. Prepare features (normalize + bias) # 2. Initialize parameters (theta) # 3. Compute cost and gradients # 4. Update parameters via gradient descent # 5. Repeat until convergence - Step 12
Development and testing
The repository includes development tooling for code quality and testing.
Linting with Pylint: The project uses Pylint with a custom configuration (
pylintrc) to maintain code quality.# Run linter on all implementations pylint ./homemadeThe
pylintrcfile contains project-specific rules and is used in CI.Continuous Integration: Travis CI automatically runs linting on every commit.
Configuration (
.travis.yml):- Python 3.6 environment
- Installs dependencies from
requirements.txt - Runs
pylint ./homemade - Email notifications disabled
Testing Approach: While the repository doesn't include formal unit tests (pytest suite), testing happens through:
- Interactive notebook execution (visual validation)
- Algorithm convergence verification
- Accuracy metrics on known datasets
- Comparison with expected results from Andrew Ng's course
Contributing Guidelines: See
CONTRIBUTING.mdfor guidelines on:- Code style and formatting
- Adding new algorithms
- Improving documentation
- Submitting issues and pull requests
# Development workflow # 1. Install dev dependencies pip install -r requirements.txt # 2. Make changes to algorithm implementations vim homemade/linear_regression/linear_regression.py # 3. Test via notebook jupyter notebook notebooks/linear_regression/... # 4. Run linter pylint ./homemade # 5. Commit (Travis CI will lint automatically) git add . git commit -m "Improve gradient descent convergence" git push - Step 13
Related projects and alternatives
Oleksii Trekhleb (@trekhleb) maintains several related educational ML projects:
Other Projects by the Same Author:
-
machine-learning-octave — Octave/MATLAB version of this repository
- GitHub: https://github.com/trekhleb/machine-learning-octave
- Uses Octave instead of Python
- Follows the same educational approach
- Direct implementations from Andrew Ng's course
-
Interactive Machine Learning Experiments — Web-based ML playground
- GitHub: https://github.com/trekhleb/machine-learning-experiments
- Live demos in the browser
- Uses TensorFlow.js
- More visual, less mathematical
-
Homemade GPT (JavaScript) — GPT implementation from scratch
- GitHub: https://github.com/trekhleb/homemade-gpt-js
- Focus on transformer architecture
- TypeScript/JavaScript implementation
When to Use Homemade Machine Learning:
- Learning ML fundamentals from scratch
- Understanding mathematical foundations
- Preparing for ML interviews
- Teaching ML concepts
- Transitioning from theory (courses) to code
When to Use Production Libraries Instead:
- Building production ML systems
- Working with large datasets
- Deploying models to production
- Time-critical development
- Advanced deep learning architectures
Production ML Libraries:
- scikit-learn (classical ML)
- TensorFlow / PyTorch (deep learning)
- XGBoost / LightGBM (gradient boosting)
- Keras (high-level neural networks)
-
- Step 14
Resources and community
Official Resources:
- GitHub Repository: https://github.com/trekhleb/homemade-machine-learning
- Author: Oleksii Trekhleb (@trekhleb)
- License: MIT License (open source, commercial use allowed)
- Stars: ~23,000+ (as of 2024)
Learning Resources:
-
Andrew Ng's ML Course: https://www.coursera.org/learn/machine-learning
- Free course on Coursera
- Most algorithms in this repo are based on this course
- Highly recommended companion resource
-
NBViewer Links: Embedded in README for each algorithm
- Fast read-only preview of notebooks
- No installation required
-
Binder: Interactive notebook execution
- Full Jupyter environment in browser
- Click "Execute on Binder" in NBViewer
Community Support:
- GitHub Issues: Bug reports and feature requests
- GitHub Discussions: Questions and community help
- Pull Requests: Contributions welcome (see CONTRIBUTING.md)
- Code of Conduct: See CODE_OF_CONDUCT.md
Supporting the Project:
- GitHub Sponsors: https://github.com/sponsors/trekhleb
- Patreon: https://www.patreon.com/trekhleb
The project is actively maintained with regular updates and improvements.
Quick Links: Repository: https://github.com/trekhleb/homemade-machine-learning Author: @trekhleb License: MIT Stars: 23K+ Learning: ├─ Andrew Ng Course: coursera.org/learn/machine-learning ├─ NBViewer: Interactive notebook previews └─ Binder: Run notebooks in browser Support: ├─ GitHub Issues (bugs) ├─ GitHub Discussions (questions) └─ Sponsors / Patreon (funding)
Feature requests
Sign in to suggest features or vote on existing ones.
No feature requests yet.
Discussion
Sign in to join the discussion.
No comments yet.