OCDevel
Walk
EpisodesResources

MLA 014 Machine Learning Hosting and Serverless Deployment

Jan 17, 2021 (updated Mar 04, 2026)

Click to Play Episode

Builders can scale ML from simple API calls to full MLOps pipelines using SST on AWS, utilizing Aurora pgvector for search and Spot instances for 90 percent cost savings. External platforms like Modal or GCP Cloud Run provide superior serverless GPU options for real-time inference when AWS native limits are reached.

Resources

Resources best viewed here
Loading...

Show Notes

CTA
Learn Faster with a Walking DeskWalk While You Learn
Sitting for hours drains energy and focus. A walking desk boosts alertness, helping you retain complex ML topics more effectively.Boost focus and energy to learn faster and retain more.Discover the benefitsDiscover the benefits

Core Infrastructure

SST uses Pulumi to bridge high-level web components (API, Database) with low-level AWS resources (SageMaker, GPU clusters). The framework enables infrastructure-as-code in TypeScript, allowing developers to manage entire ML lifecycles within a single configuration.

Level 1-2: Foundational Models and Edge Inference

  • AWS Bedrock: Managed gateway for models including Claude 4.5, Llama 4, and Nova. It provides IAM security, VPC isolation, and integrated billing.
  • Knowledge Bases: Automates RAG pipelines by chunking S3 documents and storing embeddings in Aurora pgvector.
  • Cloudflare Workers AI: Runs open-source models (Llama, Mistral, Flux) on edge GPUs. Pricing uses "Neurons" units, measuring compute per request rather than tokens.

Level 3-4: Cost-Effective CPU and Batch Processing

  • Lambda Inference: Use ONNX-formatted models on AWS Lambda with SnapStart to minimize costs and 16-second cold starts.
  • Vector Search: The SST Vector component manages semantic search within existing Aurora PostgreSQL databases using pgvector, matching dedicated database performance.
  • SST Task: Runs Fargate containers for CPU-bound ETL and data preprocessing.
  • AWS Batch: Orchestrates GPU training on EC2. Using Spot instances reduces costs by 60 to 90 percent, with checkpointing protecting against instance reclamation.

Level 5: Real-Time GPU Inference

  • AWS Options: SageMaker Real-Time endpoints support scale-to-zero since late 2024. SageMaker Async handles large payloads via S3 queues.
  • External Alternatives:
    • GCP Cloud Run: Offers serverless L4 and Blackwell GPUs with per-second billing.
    • Modal: Python-native serverless GPU platform with 2 to 4 second cold starts.
    • Groq: Uses LPU hardware for LLM inference, reaching 1300 tokens per second.
    • RunPod: Provides the lowest raw GPU pricing and FlashBoot for fast starts.

Level 6-7: MLOps and Mature Production

  • SageMaker Platform: Includes Studio for IDE work, JumpStart for one-click model deployment, and Model Registry for version tracking.
  • Monitoring: Use Arize Phoenix or Evidently AI to detect data and concept drift. Log all predictions to S3 for weekly distribution analysis.
  • Hardware Optimization: AWS Inferentia and Trainium chips offer 70 percent lower inference costs compared to GPUs. Transition becomes viable when monthly GPU spend exceeds 10,000 dollars.
  • Self-Hosting: API calls are cheaper until volume reaches 30 million tokens daily. For self-hosting, use vLLM for high-throughput PagedAttention.

Transcript

The complete ML deployment guide for SST v4 builders on AWS

Your SST v4 stack already gives you more ML infrastructure than you think - and where it falls short, the answer is rarely "build more on AWS." This guide maps every ML deployment option from your existing sst.config.ts outward to external GPU platforms, organized so you can make the cheapest, fastest decision for each use case. The key insight: most startups should call an API for LLM tasks, use Cloudflare Workers AI or Lambda for lightweight inference, and only reach for GPU infrastructure when they've proven the need. AWS has the broadest service catalog but critical gaps - no GPU on Lambda, no GPU on Fargate - that make external platforms essential for serverless GPU workloads.


1. What SST v4 gives you natively for ML

SST v4 doesn't have dedicated ML components, but its existing primitives cover more ground than you'd expect. The key is knowing what fits where.

Cloudflare Workers AI via sst.cloudflare.Worker

Despite some buzz, SST has no first-class WorkersAI component. What exists is the sst.cloudflare.Worker component, which deploys a standard Cloudflare Worker - and inside that Worker, you access Workers AI through the env.AI binding. This is Cloudflare's native pattern, not an SST abstraction:

// sst.config.ts
const worker = new sst.cloudflare.Worker("AIWorker", {
  handler: "./ai-worker.ts",
  url: true,
});
// ai-worker.ts
export default {
  async fetch(request, env) {
    const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
      messages: [{ role: "user", content: "Summarize this document" }]
    });
    return Response.json(response);
  }
};

Workers AI runs 50+ models on Cloudflare's edge network across 300+ locations. The model catalog includes Llama 4 Scout 17B, Llama 3.3 70B, DeepSeek R1 distilled variants, Mistral Small 3.1 24B, Qwen 2.5 Coder 32B, OpenAI GPT-OSS (120B and 20B), FLUX.2 for image generation, Whisper for speech-to-text, and multiple embedding models (BGE, EmbeddingGemma). Pricing is $0.011 per 1,000 Neurons (a unit measuring GPU compute per request), with a generous free tier of 10,000 Neurons/day. The Workers Paid plan costs $5/month for usage beyond free allocation. For a startup adding AI features to an existing app, Workers AI is often the fastest path to production - you get serverless GPU inference with no cold start management, edge-distributed latency, and an OpenAI-compatible API format.

Limitations matter, though. You cannot upload custom models (only Cloudflare's curated catalog), there's no fine-tuning, and large models (70B+) aren't practical on edge GPUs. Workers AI is for calling pre-deployed open-source models, not for custom ML workloads.

sst.aws.Task for CPU-bound ML work

The Task component spins up Fargate containers for async, long-running jobs. It supports up to 16 vCPU and 120 GB memory, EFS mounts for shared model files, and arm64 architecture for cost savings. At minimum configuration (0.25 vCPU + 0.5 GB), it costs roughly $0.02/hour - dramatically cheaper than Lambda for anything running longer than 15 minutes.

ML workloads that fit here: data preprocessing/ETL pipelines, scikit-learn or XGBoost inference, feature engineering, batch scoring with CPU-based models, model artifact packaging. In sst dev, tasks run locally via a configurable dev.command, making the development loop tight.

The hard limit: no GPU. Fargate does not support GPU instances, full stop. This constraint shapes everything downstream.

Other SST components relevant to ML

Component What it does for ML
sst.aws.Vector pgvector-backed vector database on Aurora Serverless v2. Up to 2,000 dimensions. Built-in SDK with put, query, remove + metadata filtering. Your RAG starting point.
sst.aws.StepFunctions (beta) Orchestrate multi-step ML pipelines across Lambda, ECS Task, Batch, CodeBuild. JSONata transforms.
sst.aws.Cluster + Service ECS Fargate services for always-on inference containers. Supports Fargate Spot for dev environments. Still no GPU.
sst.aws.Function Lambda functions (Python, Node, Rust, Go). 15-min timeout, 10 GB memory. Lightweight CPU inference only.
sst.aws.Efs Elastic File System - share model files across Tasks and Services.
sst.aws.Bucket S3 for model artifacts, training data, batch results.
sst.aws.Queue SQS for async inference request queuing.

What SST does NOT have native components for: SageMaker, Bedrock, AWS Batch, or any GPU compute. For those, you drop down to Pulumi.


2. What Pulumi unlocks inside your sst.config.ts

Since SST v4 runs on Pulumi (AWS provider v7.21.0), every @pulumi/aws resource is available directly in your config file. This means you can provision GPU infrastructure, SageMaker endpoints, Bedrock agents, and complex ML pipelines without leaving your SST project.

AWS Batch with GPU and Spot instances

AWS Batch is a managed job scheduler with zero surcharge - you pay only for the underlying EC2 instances. This makes it the cheapest path to GPU compute on AWS for training and batch inference.

import * as aws from "@pulumi/aws";

const gpuEnv = new aws.batch.ComputeEnvironment("mlGpu", {
  type: "MANAGED",
  computeResources: {
    type: "SPOT",  // 60-90% savings vs on-demand
    instanceTypes: ["g5.2xlarge"],
    maxVcpus: 16,
    minVcpus: 0,  // scale to zero when idle
    allocationStrategy: "SPOT_CAPACITY_OPTIMIZED",
    ec2Configurations: [{ imageType: "ECS_AL2_NVIDIA" }],
    // ... security groups, subnets, IAM
  },
});

Spot instances save 60–90% off on-demand GPU pricing. A g5.xlarge (A10G GPU) runs roughly $1.00/hr on-demand but $0.30–0.40/hr on Spot. For fault-tolerant training with checkpointing every 15 minutes, Spot is a no-brainer. Supported GPU instance families include g4dn (T4, ~$0.53/hr), g5 (A10G, ~$1.01/hr), g6 (latest gen), p3 (V100, ~$3.06/hr), p4d (A100, ~$21.96/hr), and p5 (H100, ~$98.32/hr). Set minVcpus: 0 and instances terminate when no jobs are running - true zero idle cost.

SageMaker endpoints and training

Pulumi's aws.sagemaker.* namespace covers the full SageMaker lifecycle: Model, EndpointConfiguration, Endpoint, NotebookInstance, Domain, FeatureGroup, Pipeline, and MonitoringSchedule. A real-time inference endpoint provisioned in your SST config looks like:

const model = new aws.sagemaker.Model("myModel", {
  executionRoleArn: sagemakerRole.arn,
  primaryContainer: {
    image: "763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:2.0-gpu",
    modelDataUrl: "s3://my-bucket/model.tar.gz",
  },
});
const config = new aws.sagemaker.EndpointConfiguration("myConfig", {
  productionVariants: [{
    instanceType: "ml.g4dn.xlarge",
    initialInstanceCount: 1,
    modelName: model.name,
    variantName: "AllTraffic",
  }],
});
const endpoint = new aws.sagemaker.Endpoint("myEndpoint", {
  endpointConfigName: config.name,
});

Bedrock agents and knowledge bases via Pulumi

The aws.bedrock.* namespace includes 25+ resources for provisioning Bedrock infrastructure as code: AgentAgent, AgentKnowledgeBase, AgentDataSource, CustomModel, Guardrail, ProvisionedModelThroughput, and more. You can wire up a complete RAG pipeline - S3 data source → embedding → Aurora pgvector store → Bedrock Knowledge Base → Agent - entirely within sst.config.ts. Community examples demonstrate building full Bedrock Knowledge Bases with SST using raw Pulumi resources.

Other Pulumi ML resources worth knowing: aws.ecs.TaskDefinition with GPU resource requirements (EC2 launch type only), aws.sfn.StateMachine for Step Functions ML orchestration (supports native SageMaker CreateTrainingJob, Batch SubmitJob, Bedrock InvokeModel actions), and aws.bedrockfoundation.getModels to query available foundation models dynamically.


3. The full AWS ML and MLOps landscape

SageMaker: the 15 tools and when each matters

SageMaker is AWS's kitchen-sink ML platform. Here's the honest startup-relevance breakdown:

Use immediately:

  • Studio - Web IDE with JupyterLab. Free tier: 250 hours of ml.t3.medium. Set auto-shutdown lifecycle configs to avoid surprise bills.
  • JumpStart - 400+ pre-trained models (Llama, Stable Diffusion, BERT variants) deployable with one click. Saves weeks of setup. No additional charge beyond compute.
  • Model Registry - Free. Version models, track metrics, approval workflows. Essential once you're iterating.

Use when you have production models:

  • Training Jobs - Managed training with per-second billing. ml.p3.2xlarge at ~$3.80/hr. Managed Spot Training saves up to 90%. Free tier: 50 hours of m5.xlarge.
  • Endpoints - Four flavors, each for a different traffic pattern (see table below).
  • Model Monitor - Detects data/model drift in production. 30 free hours of built-in rules.
  • Pipelines - ML-specific CI/CD as DAGs. No charge for the orchestration itself, only underlying compute.

Use when scale demands it:

  • Processing Jobs, Feature Store, Experiments - valuable at scale, overkill for early-stage single-model projects.

Skip unless specific need:

  • Clarify (bias detection) - important for regulated industries, skip otherwise.
  • Ground Truth (data labeling) - only if you need labeled training data at scale.
  • Canvas (no-code ML) - $1.90/hr. Useful for non-technical co-founders to prototype.

SageMaker inference: four modes compared

Mode How it works Cold start Max memory GPU? Best for
Real-time Always-on instances. Since Nov 2024, can scale to zero. None (warm) Unlimited Yes Consistent traffic, sub-second latency
Serverless Auto-provisions on request, scales to zero 1–3 seconds 6 GB No Intermittent traffic, cost-sensitive
Async Queue-based (S3 in/out), scales to zero Variable Unlimited Yes Large payloads, long inference
Batch Transform Process entire dataset, then shut down N/A Unlimited Yes Offline scoring, scheduled predictions

SageMaker Serverless Inference deserves special attention for startups. It charges ~$0.00008/second at 4 GB memory plus $0.016/GB data processed. At 10M invocations × 100ms × 2 GB, that's roughly $40/month. The crossover point where an always-on ml.m5.xlarge (~$196/month) becomes cheaper is around 800,000 requests/month at 500ms per inference. Key limitations: CPU-only, 6 GB max memory, 4 MB max payload, ~60-second timeout, no VPC support, no Model Monitor. Models above ~3–4 GB compressed won't fit. Concurrency caps at 200 per region across all serverless endpoints.

Lambda for ML: useful but fundamentally limited

AWS Lambda does not support GPU and never will. This is an architectural constraint of Firecracker (Lambda's micro-VM technology), which was explicitly designed without hardware accelerator passthrough. No re:Invent announcement has changed this.

What Lambda can do for ML inference: run ONNX Runtime models (17% faster than PyTorch for single inference), scikit-learn, LightGBM, XGBoost, and small quantized models (e.g., 4-bit GGUF models under 10 GB). Current limits are 10 GB memory, 10 GB container image, 10 GB /tmp storage, 15-minute timeout. One practitioner reported cutting inference costs from $36–171/month (SageMaker) to $3–25/month with Lambda + ONNX for a side project. Lambda SnapStart (available for Python) reduces cold starts from ~16 seconds to ~1.6 seconds for ML model loading. Best memory allocation for ML is typically 2,048 MB - doubling memory halves execution time until a plateau around 4,096 MB.

Fargate GPU: still missing, and it matters

AWS Fargate does not support GPU as of early 2026. The GitHub roadmap issue (#88) has been open for years, listed as "Work in Progress." AWS's own FAQ explicitly states: "Use Amazon EC2 for GPU workloads, which are not supported on AWS Fargate today." Some speculative sources mention enhanced GPU support coming in 2026, but no official GA announcement exists. Treat as unavailable for production planning.

The workaround is ECS on EC2 with GPU instances: use the ECS GPU-optimized AMI, set ECS_ENABLE_GPU_SUPPORT=true, and specify GPU resource requirements in your task definition. This requires managing EC2 capacity providers instead of Fargate's managed scaling - a meaningful operational burden for a solo developer. Instance options range from g4dn.xlarge at ~$0.53/hr (T4, good for inference) to p5.48xlarge at ~$98.32/hr (H100, frontier training only).

AWS Batch vs SageMaker Training

Factor AWS Batch SageMaker Training
Surcharge None - raw EC2 pricing ~15–20% premium
Container control Full Docker, any image Must conform to SageMaker container contract
ML features Basic scheduling Distributed training, HPO, debugging, profiling
Integration Manual S3/CloudWatch Native Experiments, Model Registry, Pipelines
Best for Cost optimization, custom containers Managed ML experience, integrated MLOps

Amazon's own Search team uses AWS Batch for SageMaker Training jobs, increasing GPU utilization from 40% to 80%. For a startup doing occasional training runs, Batch + Spot is likely the cheapest path.

Bedrock: the API-first path to foundation models

AWS Bedrock provides ~100 serverless models including Claude Opus 4.5/Sonnet 4.5/Haiku 4.5, Llama 4, DeepSeek V3.2, Mistral Large 3, Amazon Nova 2, GPT-OSS-120B, Gemma 3, and models from Cohere, AI21, MiniMax, and Moonshot AI. Key per-million-token pricing:

Model Input Output
Claude Sonnet 4.5 ~$3.00 ~$15.00
Claude Haiku 4.5 ~$1.00 ~$5.00
Llama 3.3 70B ~$0.72 ~$0.72
DeepSeek V3.2 $0.62 $1.85
GPT-OSS-120B $0.15 $0.62
Mistral Large 3 $0.50 $1.50
NVIDIA Nemotron Nano 2 $0.06 $0.23

Bedrock's Flex tier offers a 50% discount for lower-priority processing. Key features beyond raw inference: Knowledge Bases (managed RAG with S3 → auto-chunking → embedding → vector store, supporting Aurora PostgreSQL, OpenSearch, Pinecone), Agents (autonomous tool-using agents with MCP support), Guardrails (content filtering, PII masking, hallucination detection with automated reasoning), and fine-tuning for select models (Llama, Titan, GPT-OSS).

Inferentia and Trainium: custom silicon for scale

AWS's custom chips offer up to 70% lower cost per inference (Inferentia2) and up to 50% lower training costs (Trainium) vs comparable GPUs. The latest Trainium3 (3nm, announced Dec 2025) delivers 2.52 PFLOPS FP8 per chip. The ecosystem requires the Neuron SDK and primarily supports PyTorch with transformer architectures.

For startups: skip these until inference costs exceed ~$10K/month. The Neuron SDK learning curve and limited model architecture support add friction that isn't worth it early on. Stick with GPUs for maximum flexibility during experimentation.

Step Functions for ML orchestration

Step Functions provides native integrations with SageMaker (CreateTrainingJob, CreateTransformJob, etc.), Batch (SubmitJob), Lambda, and ECS - all with .sync wait-for-completion support. Pricing is $0.025 per 1,000 state transitions for Standard workflows. Use Step Functions when you need to mix ML and non-ML services in a pipeline. Use SageMaker Pipelines when you're fully in the SageMaker ecosystem and want built-in lineage tracking and approval workflows.


4. GPU serverless and inference platforms outside AWS

This is where the landscape gets interesting for a startup. Several external platforms solve problems that AWS cannot - particularly serverless GPU with scale-to-zero.

The five platforms that matter most

Modal is the developer experience winner. A Python-first serverless GPU platform where you decorate functions with GPU requirements and Modal handles everything: containerization, provisioning, auto-scaling, scale-to-zero. Pricing is per-second - an H100 costs ~$3.95/hr, an A100-80GB ~$2.50/hr, a T4 ~$0.59/hr. Cold starts are typically 2–4 seconds thanks to a custom container runtime that's 100x faster than Docker. Modal offers $30/month free credits on the Starter plan and up to $25K in startup credits. Best for: Python-native ML teams who want zero infrastructure management, variable workloads, fine-tuning jobs, and inference APIs.

RunPod offers the cheapest raw GPU pricing. Their serverless offering has two modes: Flex Workers (scale-to-zero, pay only when processing) and Active Workers (always-on, 20–30% discount). An A100-80GB runs ~$1.64/hr on Community Cloud, ~$2.17/hr on Secure Cloud. FlashBoot achieves sub-200ms cold starts for 48% of requests. No egress fees. Best for: cost-sensitive GPU workloads at scale, image generation, custom model serving.

Replicate provides the broadest model library with the lowest setup friction. Thousands of community models accessible via a simple API, plus 100+ "official" models that are always warm with no cold starts. Pricing is per-prediction for official models or per-second of GPU time for custom models (T4 ~$0.81/hr, A100-80GB ~$5.04/hr). Replicate was acquired by Cloudflare in 2025. Best for: rapid prototyping, running popular open-source models without infrastructure.

Groq is the speed king for LLM inference. Their custom Language Processing Unit (LPU) - an all-SRAM, deterministic-execution chip - delivers 300–1,300+ tokens/second, which is 5–15x faster than GPU-based inference. Pricing is aggressively low: Llama 3.1 8B at $0.05/$0.08 per million input/output tokens, Llama 4 Scout at $0.11/$0.34, Llama 3.3 70B at $0.59/$0.79. Limitation: inference only (no training, no fine-tuning), limited to Groq-hosted models (~10 families). Best for: speed-critical LLM inference at the lowest per-token cost.

Together AI covers the widest open-source model selection with 200+ models, including Llama, Qwen, DeepSeek, and Mixtral variants. Pricing ranges from $0.06–$3.50 per million tokens. Their Batch API offers a 50% discount. Fine-tuning supported (LoRA and full). All endpoints are OpenAI-compatible, making provider switching trivial. Fireworks AI is a similar competitor with proprietary FireAttention engine claiming 4x lower latency than vLLM.

Hugging Face Inference Endpoints

Deploy any of 60,000+ models from the Hub on dedicated infrastructure. GPU pricing: T4 ~$0.50/hr, A10G ~$1.00/hr, A100 ~$4.50/hr, H100 ~$8.00/hr. Scale-to-zero supported. Note that TGI (Text Generation Inference) entered maintenance mode in December 2025 - HF now recommends vLLM or SGLang for new deployments.

When to use what: decision matrix

Scenario Best platform Why
Add AI to existing app, lightweight Cloudflare Workers AI Edge-distributed, free tier, zero infra
LLM API, speed critical Groq 5–15x faster, cheapest per-token
LLM API, model variety needed Together AI 200+ models, OpenAI-compatible
Serverless GPU, Python team Modal Best DX, per-second billing, $30 free/month
Cheapest raw GPU compute RunPod Lowest hourly rates, no egress fees
Run popular models instantly Replicate Always-warm official models, vast library
Deploy HuggingFace models HF Inference Endpoints Seamless Hub integration
Batch LLM processing Together AI (50% off) or Fireworks (40% off) Steep batch discounts

5. AWS vs GCP vs Azure - what matters for your stack

GCP's GPU serverless advantage is real

GCP Cloud Run supports GPU, and it's been GA since June 2025. NVIDIA L4 GPUs (24GB) run at ~$0.67/hr with per-second billing, scale-to-zero, and ~5-second cold starts. NVIDIA RTX PRO 6000 Blackwell (96GB, supporting 70B+ models) is in preview. This is the serverless GPU container solution AWS hasn't shipped. For a startup that needs GPU inference with scale-to-zero economics and doesn't want to manage EC2 capacity providers, Cloud Run GPU is the most direct answer.

Beyond Cloud Run, GCP offers Vertex AI (more unified and easier to learn than SageMaker), TPUs (unique to GCP, best for large Transformer training in JAX/TensorFlow), and tight BigQuery integration for data-heavy ML workflows. Gemini 2.5 Pro at $1.25/$10 per million tokens is competitive with Claude and GPT-5.

Azure's OpenAI lock-in advantage

Azure's strongest ML play is Azure OpenAI Service: exclusive managed access to GPT-4, GPT-4o, GPT-5 family, DALL-E, and Whisper with enterprise security, VNet integration, and RBAC. No equivalent exists on AWS or GCP. Azure Container Apps now supports serverless GPUs (T4 and A100) in public preview. For enterprises already in the Microsoft ecosystem, Azure ML offers strong governance, Visual Designer, and hybrid edge deployments via Azure Arc.

Where AWS leads and where it doesn't

AWS leads on: breadth of services (200+ integrations), SageMaker maturity (since 2017, the most feature-complete ML platform), Bedrock's multi-provider model access, custom silicon (Inferentia/Trainium), and most importantly - your existing infrastructure is here. Moving data layers is expensive and risky.

AWS's key gaps: no GPU Fargate (the biggest pain point vs Cloud Run GPU), no native OpenAI model access (must use Bedrock alternatives or call OpenAI directly), steeper SageMaker learning curve vs Vertex AI, and SageMaker Serverless Inference is CPU-only.

The pragmatic verdict

Stay on AWS for your data layer, core infrastructure, and most ML services. Reach outside for specific capabilities:

  • LLM inference speed/cost → Groq, Together AI, Fireworks AI (OpenAI-compatible, near-zero switching cost)
  • Serverless GPU containers → GCP Cloud Run GPU or Modal
  • Rapid model prototyping → Replicate or Modal
  • Enterprise OpenAI access → Azure OpenAI Service
  • Edge AI features → Cloudflare Workers AI

Abstract your LLM calls behind a router (LiteLLM, Cloudflare AI Gateway) so you can switch providers without code changes.


6. The use-case decision framework

Matching workload patterns to infrastructure

Just call an API (Bedrock, OpenAI, Anthropic direct, Together AI): This is the right starting point for 90% of startup LLM use cases. Zero ops overhead. Focus on product-market fit, not infrastructure. Use when: chat, summarization, code generation, classification, embeddings. Cost: $50–500/month for moderate usage.

Serverless CPU inference (Lambda + ONNX, SageMaker Serverless, Workers AI): For sporadic traffic with acceptable cold starts. Lambda + ONNX can run scikit-learn/XGBoost models for $3–25/month. SageMaker Serverless handles larger models up to 6 GB for ~$40/month at 10M invocations. Workers AI gives you GPU inference on pre-deployed models with a 10K Neurons/day free tier. Best when: traffic is spiky, cost is paramount, latency tolerance is 1–5 seconds.

Batch processing (AWS Batch + Spot GPU, SageMaker Batch Transform): For training runs, bulk inference on datasets, feature engineering pipelines, and hyperparameter sweeps. Pay only during execution, zero idle cost. Batch + Spot GPU saves 60–90% vs on-demand. Best when: latency tolerance is hours, workloads are fault-tolerant with checkpointing.

Always-on inference (ECS on EC2 with GPU, SageMaker Real-Time Endpoints): For consistent, high-volume traffic requiring sub-100ms latency. SageMaker Real-Time now supports scale-to-zero (since Nov 2024), narrowing the gap with serverless. An ml.g5.xlarge endpoint costs ~$1,030/month always-on but handles thousands of requests per second. Best when: traffic exceeds ~800K requests/month, latency SLA below 100ms.

Managed MLOps (SageMaker end-to-end): When you have 3+ ML engineers, multiple models in production, and need experiment tracking, model registry, pipeline orchestration, and monitoring as an integrated system. Before that scale, Docker + ECS + CloudWatch + GitHub Actions is a perfectly valid ML platform.

Traffic-level thresholds

Monthly requests Recommended approach Approximate cost
< 1,000 Lambda or API calls < $10/month
1K–100K SageMaker Serverless or Lambda + ONNX $10–100/month
100K–1M ECS/Fargate or SageMaker Serverless $50–500/month
1M–10M ECS with auto-scaling or SageMaker Real-Time $200–2,000/month
10M+ SageMaker Real-Time with auto-scaling or self-hosted $1,000+/month

7. The foundation model landscape and when to self-host

Current pricing for frontier models

The foundation model market has compressed dramatically. Here are the models that matter most for startups as of early 2026:

Provider Model Input/MTok Output/MTok Best for
Anthropic Claude Opus 4.6 $5.00 $25.00 Complex reasoning, agents, coding
Anthropic Claude Sonnet 4.5 $3.00 $15.00 Balanced quality/cost
Anthropic Claude Haiku 4.5 $1.00 $5.00 Fast, cost-optimized
OpenAI GPT-5.2 $1.75 $14.00 General purpose flagship
OpenAI GPT-5 Mini $0.25 $2.00 Budget, high-volume
OpenAI GPT-5 Nano $0.05 $0.40 Cheapest GPT
Google Gemini 2.5 Pro $1.25 $10.00 Long context (1M), reasoning
Google Gemini 2.0 Flash $0.15 $0.60 Ultra-cheap, high-volume
Meta (via Bedrock) Llama 3.3 70B $0.72 $0.72 Open-source, fine-tunable
AWS Amazon Nova 2 Pro Varies Varies Native AWS, 1M context

Both Anthropic and OpenAI offer prompt caching (90% savings on cached input tokens) and batch processing (50% discount). These features alone can cut LLM costs by 50–90% for many workloads.

Bedrock vs direct API: when each wins

Use Bedrock when you want unified billing, VPC-level security, IAM integration, built-in RAG (Knowledge Bases), Agents, Guardrails, and the ability to switch between models through a unified API without infrastructure changes. Data never leaves AWS. Batch inference at 50% discount via the Flex tier.

Use the direct Anthropic/OpenAI API when you need the absolute latest model versions (Bedrock lags slightly), provider-specific features (Anthropic's extended thinking, OpenAI's realtime API), or simpler integration for a single provider.

The self-hosting decision

Self-hosting LLMs (via vLLM, Ollama, or SGLang) makes financial sense only at high volume. The breakeven is roughly 30M+ tokens/day - below that, API costs are lower because you avoid paying for idle GPU capacity. A single A100-80GB on AWS costs ~$2,000+/month regardless of utilization.

vLLM is the production standard for self-hosted LLM serving: 793 tokens/sec peak throughput, PagedAttention for 40%+ memory savings, OpenAI-compatible API, distributed inference across GPUs. Ollama is best for local development only (caps at ~4 parallel requests). TGI entered maintenance mode in December 2025 - new projects should use vLLM or SGLang.

For most startups: just call an API. Self-host only when you hit $10K+/month in API spend with >70% GPU utilization potential, have strict compliance requirements (air-gapped environments), or need custom model modifications that API providers don't support.


8. What you're not asking but should be

Start with pgvector - you already have it

Your Aurora Serverless v2 supports the pgvector extension, and SST's sst.aws.Vector component wraps it with a clean SDK. For RAG applications with under 10 million vectors, pgvector matches or beats dedicated vector databases on performance. With pgvectorscale, benchmarks show 471 QPS at 99% recall on 50M vectors with p95 latency of 28ms. HNSW indexing and iterative scan (pgvector 0.8.0) delivered a 5.7x query performance improvement. You get ACID transactions, combined relational + vector queries, and zero additional infrastructure cost. Only evaluate Pinecone or Qdrant Cloud if you outgrow 10M vectors or need auto-scaling for extremely spiky traffic.

ML observability: the minimum viable stack

Start with MLflow (self-hosted on Fargate) + CloudWatch. MLflow is free, open-source, and provides experiment tracking, model packaging, and a model registry. Host the tracking server on a small Fargate task backed by your Aurora database and S3 for artifacts. CloudWatch handles infrastructure metrics for SageMaker endpoints and ECS services. Add Arize Phoenix (free, open-source) or Evidently AI (free, open-source) for production model monitoring when you have models deployed. Only consider W&B Pro ($50/user/month) or Arize AX Pro ($50/month) when you have 5+ models in production.

Model versioning without overengineering

Use DVC (Data Version Control) for Git-native data and model versioning - it stores lightweight pointer files in Git while actual artifacts live in S3. Combined with MLflow Model Registry's staging workflow (None → Staging → Production → Archived), this gives you reproducible experiments and controlled promotion without SageMaker-level complexity. For A/B testing, SageMaker supports Production Variants with traffic splitting and shadow deployments (new model receives real traffic but responses aren't returned to users). Canary deployments with automatic rollback via CloudWatch alarms are also built in.

Cost traps and optimization strategies

  • Spot instances for training save 60–90% - Spotify cut ML costs from $8.2M to $2.4M. Checkpoint every 15 minutes, use 10–15 diverse instance types, and set SPOT_CAPACITY_OPTIMIZED allocation strategy.
  • SageMaker Savings Plans save up to 64% for 1–3 year commitments across all SageMaker workloads. Only commit after 30–90 days of stable usage baseline.
  • Graviton instances (ARM-based) deliver up to 40% better price-performance for CPU-based inference (scikit-learn, XGBoost).
  • Set AWS Budgets alerts immediately. ML costs spike 10x overnight when a training job runs too long or an endpoint auto-scales unexpectedly.
  • Track LLM token usage from day 1. A Bedrock call that costs $0.01 per request becomes $10,000/month at 1M requests.

The minimal CI/CD pipeline for ML

For a startup, this is enough:

  1. Git push triggers GitHub Actions
  2. Build: lint, unit test, build Docker image, push to ECR
  3. Train (triggered on data/model changes): submit SageMaker Training Job or Batch job
  4. Evaluate: compare metrics against baseline; gate on accuracy threshold
  5. Register: log model to MLflow or S3 with version tag
  6. Deploy to staging: update ECS service or SageMaker endpoint
  7. Smoke test: hit endpoint with test data, verify response
  8. Promote to production: manual approval or automated

One team reported running this entire pipeline for < $820/month. The full cycle from push to staged deployment takes ~17–22 minutes.

Security non-negotiables

Deploy SageMaker and Bedrock resources in VPC subnets with network isolation. Use VPC Interface Endpoints (PrivateLink) for SageMaker Runtime and Bedrock to keep ML traffic off the public internet. Scope IAM policies with least-privilege - bedrock:InvokeModel restricted to specific model ARNs, separate execution roles for training vs inference. Enable KMS encryption on all S3 model artifact buckets. If targeting healthcare or finance, plan security architecture from day 1 - retrofitting HIPAA/SOC2 compliance is expensive and painful.


Conclusion

The ML deployment landscape for SST v4 builders breaks into a clear decision hierarchy. For LLM features, start by calling an API - Bedrock for AWS-native integration, or Anthropic/OpenAI/Together AI directly for latest models and lowest latency. Your existing Aurora Serverless v2 with pgvector is a production-ready vector store for RAG. For custom model inference, Workers AI and Lambda handle lightweight CPU workloads at near-zero cost. When you need GPU, don't fight AWS's Fargate/Lambda limitations - reach for Modal or GCP Cloud Run GPU for serverless GPU containers, or AWS Batch with Spot instances for training jobs at 60–90% savings.

The insight most builders miss: the operational complexity gap between "call an API" and "manage GPU infrastructure" is enormous. A solo developer calling Claude via Bedrock can ship an AI feature in a day. That same developer provisioning SageMaker endpoints with auto-scaling, model monitoring, and CI/CD needs weeks. Exhaust the API-first approach before reaching for infrastructure. When you do need infrastructure, Pulumi inside your sst.config.ts can provision AWS Batch, SageMaker, and Bedrock resources without leaving your deployment stack - and external platforms like Modal and Groq fill the gaps where AWS falls short. The goal isn't to build an MLOps platform; it's to ship ML features that your users value.