Implementing MLOps

Operationalize machine learning models from experimentation to production deployment and monitoring.

When to Use

Use this skill when:

Designing MLOps infrastructure for production ML systems
Selecting experiment tracking platforms (MLflow, Weights & Biases, Neptune)
Implementing feature stores for online/offline feature serving
Choosing model serving solutions (Seldon Core, KServe, BentoML, TorchServe)
Building ML pipelines for training, evaluation, and deployment
Setting up model monitoring and drift detection
Establishing model governance and compliance frameworks
Optimizing ML inference costs and performance
Migrating from notebooks to production ML systems
Implementing continuous training and automated retraining

Key Features

1. Experiment Tracking and Model Registry

Platform Options:

MLflow: Open-source standard, framework-agnostic, self-hosted or cloud-agnostic
Weights & Biases: Advanced visualization, team collaboration, integrated hyperparameter optimization
Neptune.ai: Enterprise features (RBAC, audit logs, compliance)

Model Registry Components:

Model artifacts with version control and stage management
Training metrics, hyperparameters, and dataset versions
Model cards for documentation and limitations
Feature schema (input/output signatures)

Selection Criteria:

Open-source requirement → MLflow
Team collaboration critical → Weights & Biases
Enterprise compliance (RBAC, audits) → Neptune.ai

2. Feature Stores

Problem Addressed: Training/serving skew from inconsistent feature engineering

Feature Store Solution:

Online Store: Low-latency retrieval for real-time inference (Redis, DynamoDB)
Offline Store: Historical data for training and backtesting (Parquet, BigQuery)
Point-in-Time Correctness: No future data leakage during training

Platform Comparison:

Feast: Open-source, cloud-agnostic, most popular
Tecton: Managed service, production-grade, Feast-compatible API
SageMaker Feature Store: AWS ecosystem integration
Databricks Feature Store: Unity Catalog integration

3. Model Serving Patterns

Serving Options:

REST API: HTTP endpoint for synchronous predictions (<100ms latency)
gRPC: High-performance RPC for low-latency inference (<10ms)
Batch Inference: Process large datasets offline (minutes to hours)
Streaming Inference: Real-time predictions on streaming data (milliseconds)

Platform Selection:

Seldon Core: Kubernetes-native, advanced deployment strategies (canary, A/B testing)
KServe: CNCF standard, serverless scaling with Knative
BentoML: Python-first, simplest to get started
TorchServe: PyTorch-specific, production-grade
TensorFlow Serving: TensorFlow-specific, optimized

4. Deployment Strategies

Blue-Green Deployment: Two environments, instant traffic switch, instant rollback

Canary Deployment: Gradual rollout (5% → 10% → 25% → 50% → 100%)

Shadow Deployment: New model receives traffic but predictions not used (zero production risk)

A/B Testing: Split traffic between versions, measure business metrics

Multi-Armed Bandit: Epsilon-greedy exploration for continuous optimization

5. ML Pipeline Orchestration

Pipeline Stages:

Data Validation (Great Expectations)
Feature Engineering
Data Splitting (train/validation/test)
Model Training with hyperparameter tuning
Model Evaluation (accuracy, fairness, explainability)
Model Registration
Deployment

Platform Comparison:

Kubeflow Pipelines: ML-native, Kubernetes-based
Apache Airflow: Mature, general-purpose, large ecosystem
Metaflow: Netflix, data science-friendly, easy for scientists
Prefect: Modern, dynamic workflows, better error handling than Airflow
Dagster: Asset-based thinking, strong testing features

6. Model Monitoring and Drift Detection

Data Drift Detection:

Kolmogorov-Smirnov (KS) Test: Compare distributions
Population Stability Index (PSI): Measure distribution shift
Chi-Square Test: For categorical features

Model Drift Detection:

Ground truth accuracy (delayed labels)
Prediction distribution changes
Calibration drift (predicted probabilities vs actual outcomes)

Performance Monitoring:

Latency: P50, P95, P99 inference time
Throughput: Predictions per second
Error Rate: Failed predictions / total
Resource Utilization: CPU, memory, GPU usage

Tools:

Evidently AI: Data drift, model drift reports
Prometheus + Grafana: Performance metrics
Arize AI: ML observability platform
Fiddler: Model monitoring and explainability

7. Model Optimization Techniques

Quantization: Convert float32 to int8 (4x smaller, 2-3x faster)

Model Distillation: Train small student model to mimic large teacher (2-10x smaller)

ONNX Conversion: Cross-framework compatibility, optimized inference (1.5-3x faster)

Model Pruning: Remove less important weights (2-10x smaller)

8. LLMOps Patterns

LLM Fine-Tuning: LoRA and QLoRA for parameter-efficient fine-tuning

Prompt Versioning: Version control for prompts, A/B testing

RAG System Monitoring: Retrieval quality, generation quality, hallucination detection

LLM Inference Optimization:

vLLM: High-throughput LLM serving
TensorRT-LLM: NVIDIA-optimized
Text Generation Inference (TGI): Hugging Face serving

9. Model Governance and Compliance

Model Cards: Documentation of model purpose, limitations, biases, ethical considerations

Bias and Fairness Detection:

Tools: Fairlearn, AI Fairness 360 (IBM)
Metrics: Demographic parity, equalized odds, calibration

Regulatory Compliance:

EU AI Act: High-risk AI systems documentation
Model Risk Management (SR 11-7): Banking requirements
GDPR: Right to explanation
HIPAA: Healthcare data privacy

Quick Start

Setup MLflow Server

# Install MLflow
pip install mlflow

# Start local tracking server
mlflow server --host 0.0.0.0 --port 5000

Track Experiments

import mlflow
from sklearn.ensemble import RandomForestClassifier

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("my-experiment")

with mlflow.start_run():
    model = RandomForestClassifier(n_estimators=100)
    model.fit(X_train, y_train)

    accuracy = model.score(X_test, y_test)
    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_param("n_estimators", 100)
    mlflow.sklearn.log_model(model, "model")

Deploy with BentoML

import bentoml
from sklearn.ensemble import RandomForestClassifier

# Train and save model
model = RandomForestClassifier()
model.fit(X_train, y_train)
bentoml.sklearn.save_model("my_model", model)

# Create service
import bentoml
from bentoml.io import JSON

@bentoml.service
class MyService:
    model = bentoml.sklearn.get("my_model:latest")

    @bentoml.api
    def predict(self, input_data: JSON) -> JSON:
        return self.model.predict([input_data])

Tool Recommendations

Production-Ready Stack (Startup)

Experiment Tracking: MLflow (free, self-hosted)
Feature Store: Skip initially → Feast when needed
Model Serving: BentoML (easy) or cloud functions
Orchestration: Prefect or cron jobs
Monitoring: Basic logging + Prometheus

Production-Ready Stack (Growth Company)

Experiment Tracking: Weights & Biases or MLflow
Feature Store: Feast (open-source, production-ready)
Model Serving: BentoML or KServe (Kubernetes-based)
Orchestration: Kubeflow Pipelines or Airflow
Monitoring: Evidently + Prometheus + Grafana

Production-Ready Stack (Enterprise)

Experiment Tracking: MLflow (self-hosted) or Neptune.ai (compliance)
Feature Store: Tecton (managed) or Feast (self-hosted)
Model Serving: Seldon Core (advanced) or KServe
Orchestration: Kubeflow Pipelines or Airflow
Monitoring: Evidently + Prometheus + Grafana + PagerDuty

ai-data-engineering: Feature engineering, ML algorithms, data preparation
kubernetes-operations: K8s cluster management, GPU scheduling for ML workloads
implementing-observability: Monitoring, alerting, distributed tracing for ML systems
designing-distributed-systems: Scalability patterns for ML workloads
building-ai-chat: LLM-powered applications consuming ML models

Best Practices

Version Everything: Code (Git), data (DVC), models (semantic versioning), features
Automate Testing: Unit tests, integration tests, model validation
Monitor Continuously: Data drift, model drift, performance
Start Simple: Begin with MLflow + basic serving, add complexity as needed
Point-in-Time Correctness: Use feature stores to avoid training/serving skew
Deployment Strategies: Use canary for medium-risk, shadow for high-risk models
Governance: Model cards, audit trails, compliance tracking
Cost Optimization: Quantization, spot instances, autoscaling

References

Full skill documentation: /skills/implementing-mlops/SKILL.md
Experiment tracking guide: /skills/implementing-mlops/references/experiment-tracking.md
Model registry patterns: /skills/implementing-mlops/references/model-registry.md
Feature stores: /skills/implementing-mlops/references/feature-stores.md
Model serving: /skills/implementing-mlops/references/model-serving.md
Deployment strategies: /skills/implementing-mlops/references/deployment-strategies.md
ML pipelines: /skills/implementing-mlops/references/ml-pipelines.md
Model monitoring: /skills/implementing-mlops/references/model-monitoring.md
LLMOps patterns: /skills/implementing-mlops/references/llmops-patterns.md

When to Use​

Key Features​

1. Experiment Tracking and Model Registry​

2. Feature Stores​

3. Model Serving Patterns​

4. Deployment Strategies​

5. ML Pipeline Orchestration​

6. Model Monitoring and Drift Detection​

7. Model Optimization Techniques​

8. LLMOps Patterns​

9. Model Governance and Compliance​

Quick Start​

Setup MLflow Server​

Track Experiments​

Deploy with BentoML​

Tool Recommendations​

Production-Ready Stack (Startup)​

Production-Ready Stack (Growth Company)​

Production-Ready Stack (Enterprise)​

Related Skills​

Best Practices​

References​