MLOps
December 4, 2023
Siddhartha Vemuganti
Data Engineering & Cloud Architecture
Introduction
This guide is designed to walk you through the foundational steps of building a robust MLOps pipeline using a combination of modern tools suited for data science projects. It aims to provide you with practical steps, from data management to production deployment and monitoring.
Data Management with DVC (Data Version Control)
Getting Started with DVC
Installation and Configuration
pip install dvc dvc init
This sets up DVC in your project, initializing a
.dvc
directory to track data versions.Basic Commands
dvc add data/dataset.csv git add data/dataset.csv.dvc data/.gitignore git commit -m "Track dataset with DVC" dvc push
These commands track your datasets, commit data pointers to Git, and push the actual data to DVC storage.
Common Gotchas
- Ensure you have the necessary storage setup for DVC (e.g., S3, GCS).
- Check for
.dvc/config
for proper remote setup.
Advanced Usage: Automating Pipelines
- Creating PipelinesThis command creates a reproducible stage for preparing data, tracking the script and outputs automatically.
dvc run -n prepare -d src/prepare.py -o data/prepared.csv python src/prepare.py
Experiment Tracking with MLflow
Setting Up MLflow
Installation
pip install mlflow mlflow ui
This command installs MLflow and starts the MLflow UI for tracking experiments.
Logging Experiments
import mlflow mlflow.start_run() mlflow.log_param("num_trees", 100) mlflow.log_metric("accuracy", 0.95) mlflow.end_run()
Logs experiment parameters and metrics, essential for comparing different model versions.
Integration with DVC
- Linking Data and Experiments Use DVC to manage data and MLflow to track experiments using the same dataset, ensuring consistency.
Model Training
Model training is a core component of building machine learning systems. It involves developing models that can learn from data to make predictions or decisions without being explicitly programmed to perform the task.
Steps in Model Training
- Data Preparation: Splitting data into training, validation, and test sets.
- Model Selection: Choosing the right model based on the problem type (e.g., regression, classification).
- Model Training: Actual process of teaching the model using training data.
- Evaluation: Assessing model performance using validation and test sets.
Parameter Tuning and Overfitting
- Cross-Validation: Used to ensure that the model generalizes well to new data.
- Regularization Techniques: Methods like L1 and L2 regularization to prevent overfitting.
Example Workflows
- Using TensorFlow for Classification
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
# Model definition
model = Sequential([
Dense(32, activation='relu', input_shape=(10,)),
Dense(1, activation='sigmoid')
])
# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# Train the model
model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=10)
- Using Scikit-Learn for Regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Data preparation
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
# Model training
model = LinearRegression()
model.fit(X_train, y_train)
# Evaluation
predictions = model.predict(X_val)
mse = mean_squared_error(y_val, predictions)
print(f'Mean Squared Error: {mse}')
API Deployment with Flask on Heroku
Installation and Setup
pip install Flask gunicorn
Gunicorn is used as the WSGI server for running Flask applications in production.
Example Flask Application for Predictions
from flask import Flask, request, jsonify app = Flask(__name__) @app.route('/predict', methods=['POST']) def predict(): data = request.get_json() # Prediction logic here return jsonify({"prediction": result}) if __name__ == '__main__': app.run()
This Flask application provides a simple API endpoint for making predictions.
Deploying to Heroku
- Add a
Procfile
and scale your dynos appropriately. - Monitor logs via
heroku logs --tail
.
- Add a
Common Gotchas
- Timeout issues: Set appropriate timeout settings in Heroku.
- Memory limits: Optimize model loading and inference to handle Heroku’s memory constraints.
Continuous Integration and Deployment with GitHub Actions
Continuous Integration and Deployment are critical components of modern software development practices, especially in MLOps, enabling faster and more reliable delivery of applications. GitHub Actions provides a powerful platform to automate these workflows directly within your GitHub repository.
Setup CI/CD Pipeline
GitHub Actions Configuration
To automate the deployment of a Flask application to Heroku, you need to create a GitHub Actions workflow. This workflow will handle installation, testing, build processes, and deployment upon every commit to the main branch of your repository.
Here’s a step-by-step guide and example workflow:
Create a Workflow File: In your GitHub repository, create a directory named
.github/workflows
if it doesn’t already exist. Inside this directory, create a YAML file for your workflow, e.g.,flask_ci_cd.yml
.Define Workflow Configuration:
name: Flask CI/CD Pipeline
on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.8'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
- name: Run tests
run: |
# Add commands to run your tests here
pytest
- name: Deploy to Heroku
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
uses: akhileshns/heroku-deploy@v3.12.12
with:
heroku_api_key: ${{ secrets.HEROKU_API_KEY }}
heroku_app_name: "your-app-name" # Replace with your Heroku app's name
heroku_email: "your-email@example.com" # Replace with your Heroku account email
## Model Monitoring and Evaluation with Evidently AI
### Setting Up Evidently AI
- **Installation**
```bash
pip install evidently
This tool helps monitor model performance and detect data drift in your production environment.
Integration Example
- Monitoring Dashboard Create dashboards to visualize model performance metrics and data quality.
Conclusion
This guide provides a comprehensive overview of building and maintaining an MLOps pipeline with a focus on practical implementation and troubleshooting common issues.
Tool-Specific Gotchas
When working with tools in MLOps, it’s important to be aware of common pitfalls and how to avoid them. This section highlights specific gotchas for the tools discussed earlier and provides tips for troubleshooting.
Apache Spark
- Memory Management: Spark applications can face OutOfMemory errors if not configured properly. Monitor the storage and execution memory parameters closely.
- Shuffle Operations: Shuffle operations can be costly. Minimize them by optimizing your transformations or increasing
spark.sql.shuffle.partitions
when necessary.
TensorFlow
- GPU Memory Issues: Running TensorFlow on GPUs can lead to memory overflow. Use
tf.config.experimental.set_memory_growth(True)
to prevent TensorFlow from allocating all GPU memory upfront. - Version Compatibility: TensorFlow’s rapid development pace means new versions might not be fully compatible with old code. Always check the upgrade guides.
Scikit-Learn
- Data Scaling: Many models in Scikit-Learn require feature scaling for optimal performance. Always scale your input features before training.
- Model Persistence: When saving and loading models, use
joblib
instead ofpickle
for efficiency with large numpy arrays.
These insights should help you avoid common issues and make better use of these powerful tools.