MLOps

December 4, 2023

Siddhartha Vemuganti

Data Engineering & Cloud Architecture

Introduction

This guide is designed to walk you through the foundational steps of building a robust MLOps pipeline using a combination of modern tools suited for data science projects. It aims to provide you with practical steps, from data management to production deployment and monitoring.

Data Management with DVC (Data Version Control)

Getting Started with DVC

  • Installation and Configuration

    pip install dvc
    dvc init
    

    This sets up DVC in your project, initializing a .dvc directory to track data versions.

  • Basic Commands

    dvc add data/dataset.csv
    git add data/dataset.csv.dvc data/.gitignore
    git commit -m "Track dataset with DVC"
    dvc push
    

    These commands track your datasets, commit data pointers to Git, and push the actual data to DVC storage.

  • Common Gotchas

    • Ensure you have the necessary storage setup for DVC (e.g., S3, GCS).
    • Check for .dvc/config for proper remote setup.

Advanced Usage: Automating Pipelines

  • Creating Pipelines
    dvc run -n prepare -d src/prepare.py -o data/prepared.csv python src/prepare.py
    
    This command creates a reproducible stage for preparing data, tracking the script and outputs automatically.

Experiment Tracking with MLflow

Setting Up MLflow

  • Installation

    pip install mlflow
    mlflow ui
    

    This command installs MLflow and starts the MLflow UI for tracking experiments.

  • Logging Experiments

    import mlflow
    mlflow.start_run()
    mlflow.log_param("num_trees", 100)
    mlflow.log_metric("accuracy", 0.95)
    mlflow.end_run()
    

    Logs experiment parameters and metrics, essential for comparing different model versions.

Integration with DVC

  • Linking Data and Experiments Use DVC to manage data and MLflow to track experiments using the same dataset, ensuring consistency.

Model Training

Model training is a core component of building machine learning systems. It involves developing models that can learn from data to make predictions or decisions without being explicitly programmed to perform the task.

Steps in Model Training

  • Data Preparation: Splitting data into training, validation, and test sets.
  • Model Selection: Choosing the right model based on the problem type (e.g., regression, classification).
  • Model Training: Actual process of teaching the model using training data.
  • Evaluation: Assessing model performance using validation and test sets.

Parameter Tuning and Overfitting

  • Cross-Validation: Used to ensure that the model generalizes well to new data.
  • Regularization Techniques: Methods like L1 and L2 regularization to prevent overfitting.

Example Workflows

  • Using TensorFlow for Classification
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# Model definition
model = Sequential([
    Dense(32, activation='relu', input_shape=(10,)),
    Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=10)
  • Using Scikit-Learn for Regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Data preparation
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Model training
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluation
predictions = model.predict(X_val)
mse = mean_squared_error(y_val, predictions)
print(f'Mean Squared Error: {mse}')

API Deployment with Flask on Heroku

  • Installation and Setup

    pip install Flask gunicorn
    

    Gunicorn is used as the WSGI server for running Flask applications in production.

  • Example Flask Application for Predictions

    from flask import Flask, request, jsonify
    
    app = Flask(__name__)
    
    @app.route('/predict', methods=['POST'])
    def predict():
        data = request.get_json()
        # Prediction logic here
        return jsonify({"prediction": result})
    
    if __name__ == '__main__':
        app.run()
    

    This Flask application provides a simple API endpoint for making predictions.

  • Deploying to Heroku

    • Add a Procfile and scale your dynos appropriately.
    • Monitor logs via heroku logs --tail.

Common Gotchas

  • Timeout issues: Set appropriate timeout settings in Heroku.
  • Memory limits: Optimize model loading and inference to handle Heroku’s memory constraints.

Continuous Integration and Deployment with GitHub Actions

Continuous Integration and Deployment are critical components of modern software development practices, especially in MLOps, enabling faster and more reliable delivery of applications. GitHub Actions provides a powerful platform to automate these workflows directly within your GitHub repository.

Setup CI/CD Pipeline

GitHub Actions Configuration

To automate the deployment of a Flask application to Heroku, you need to create a GitHub Actions workflow. This workflow will handle installation, testing, build processes, and deployment upon every commit to the main branch of your repository.

Here’s a step-by-step guide and example workflow:

  1. Create a Workflow File: In your GitHub repository, create a directory named .github/workflows if it doesn’t already exist. Inside this directory, create a YAML file for your workflow, e.g., flask_ci_cd.yml.

  2. Define Workflow Configuration:

name: Flask CI/CD Pipeline

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

jobs:
  build:
    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v2
    - name: Set up Python
      uses: actions/setup-python@v2
      with:
        python-version: '3.8'

    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt        

    - name: Run tests
      run: |
        # Add commands to run your tests here
        pytest        

    - name: Deploy to Heroku
      if: github.ref == 'refs/heads/main' && github.event_name == 'push'
      uses: akhileshns/heroku-deploy@v3.12.12
      with:
        heroku_api_key: ${{ secrets.HEROKU_API_KEY }}
        heroku_app_name: "your-app-name" # Replace with your Heroku app's name
        heroku_email: "your-email@example.com" # Replace with your Heroku account email

## Model Monitoring and Evaluation with Evidently AI

### Setting Up Evidently AI

- **Installation**
  ```bash
  pip install evidently

This tool helps monitor model performance and detect data drift in your production environment.

Integration Example

  • Monitoring Dashboard Create dashboards to visualize model performance metrics and data quality.

Conclusion

This guide provides a comprehensive overview of building and maintaining an MLOps pipeline with a focus on practical implementation and troubleshooting common issues.

Tool-Specific Gotchas

When working with tools in MLOps, it’s important to be aware of common pitfalls and how to avoid them. This section highlights specific gotchas for the tools discussed earlier and provides tips for troubleshooting.

Apache Spark

  • Memory Management: Spark applications can face OutOfMemory errors if not configured properly. Monitor the storage and execution memory parameters closely.
  • Shuffle Operations: Shuffle operations can be costly. Minimize them by optimizing your transformations or increasing spark.sql.shuffle.partitions when necessary.

TensorFlow

  • GPU Memory Issues: Running TensorFlow on GPUs can lead to memory overflow. Use tf.config.experimental.set_memory_growth(True) to prevent TensorFlow from allocating all GPU memory upfront.
  • Version Compatibility: TensorFlow’s rapid development pace means new versions might not be fully compatible with old code. Always check the upgrade guides.

Scikit-Learn

  • Data Scaling: Many models in Scikit-Learn require feature scaling for optimal performance. Always scale your input features before training.
  • Model Persistence: When saving and loading models, use joblib instead of pickle for efficiency with large numpy arrays.

These insights should help you avoid common issues and make better use of these powerful tools.