In real-life machine learning (ML) projects, modular programming and version control using Git/GitHub are critical for building scalable, maintainable, and collaborative codebases.
Let’s break it down step by step, and then I’ll show you a simple project structure with modular code and explain how it would be used on GitHub.
🧱 1. What is Modular Programming in ML Projects?
Instead of writing one giant Jupyter notebook, modular programming means:
- Splitting your project into logical modules/files
- Each module has a specific role: data processing, model training, prediction, etc.
- Makes the code reusable, testable, and cleaner
🗂️ 2. Typical Project Folder Structure
ml-project/
│
├── data/ # Raw or processed data
│ └── train.csv
│
├── notebooks/ # Jupyter notebooks for exploration or reports
│ └── EDA.ipynb
│
├── src/ # Source code
│ ├── __init__.py
│ ├── data_loader.py # Load and preprocess data
│ ├── model.py # Train and evaluate model
│ ├── predict.py # Make predictions with saved model
│ └── utils.py # Utility functions
│
├── models/ # Saved models
│ └── model.pkl
│
├── main.py # Run the pipeline end-to-end
├── requirements.txt # Project dependencies
└── README.md # Project overview
🧑💻 3. Sample Modular Code: Titanic Prediction (Simplified)
🔹 src/data_loader.py
import pandas as pd
from sklearn.model_selection import train_test_split
def load_data(path):
df = pd.read_csv(path)
df = df.dropna()
X = df[['Pclass', 'Age', 'Fare']]
y = df['Survived']
return train_test_split(X, y, test_size=0.2, random_state=42)
🔹 src/model.py
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import joblib
def train_model(X_train, y_train):
model = LogisticRegression()
model.fit(X_train, y_train)
joblib.dump(model, 'models/model.pkl')
return model
def evaluate_model(model, X_test, y_test):
preds = model.predict(X_test)
return accuracy_score(y_test, preds)
🔹 main.py
from src.data_loader import load_data
from src.model import train_model, evaluate_model
import joblib
def run():
X_train, X_test, y_train, y_test = load_data('data/train.csv')
model = train_model(X_train, y_train)
accuracy = evaluate_model(model, X_test, y_test)
print(f'Model Accuracy: {accuracy:.2f}')
if __name__ == '__main__':
run()
🔹 requirements.txt
pandas
scikit-learn
joblib
🌐 4. Using GitHub
- Initialize Git:
git init git add . git commit -m "Initial commit with modular Titanic ML project"
- Push to GitHub:
git remote add origin https://github.com/yourusername/titanic-ml.git git push -u origin main
- Best Practices on GitHub:
- Write a clear
README.md
with:- Project goal
- Setup instructions
- Example commands
- Use
.gitignore
to exclude large files like datasets or.pkl
models. - Commit frequently with clear messages.
- Use branches for experimentation (
git checkout -b feature-model-tuning
).
- Write a clear
🧠 Why This Approach?
- ✅ Easier to debug and update.
- ✅ Collaborators can work on separate modules without conflicts.
- ✅ GitHub enables version control, code reviews, and issue tracking.