Click to Play Episode
Builders can scale ML from simple API calls to full MLOps pipelines using SST on AWS, utilizing Aurora pgvector for search and Spot instances for 90 percent cost savings. External platforms like Modal or GCP Cloud Run provide superior serverless GPU options for real-time inference when AWS native limits are reached.
SST uses Pulumi to bridge high-level web components (API, Database) with low-level AWS resources (SageMaker, GPU clusters). The framework enables infrastructure-as-code in TypeScript, allowing developers to manage entire ML lifecycles within a single configuration.
Your SST v4 stack already gives you more ML infrastructure than you think - and where it falls short, the answer is rarely "build more on AWS." This guide maps every ML deployment option from your existing sst.config.ts outward to external GPU platforms, organized so you can make the cheapest, fastest decision for each use case. The key insight: most startups should call an API for LLM tasks, use Cloudflare Workers AI or Lambda for lightweight inference, and only reach for GPU infrastructure when they've proven the need. AWS has the broadest service catalog but critical gaps - no GPU on Lambda, no GPU on Fargate - that make external platforms essential for serverless GPU workloads.
SST v4 doesn't have dedicated ML components, but its existing primitives cover more ground than you'd expect. The key is knowing what fits where.
sst.cloudflare.WorkerDespite some buzz, SST has no first-class WorkersAI component. What exists is the sst.cloudflare.Worker component, which deploys a standard Cloudflare Worker - and inside that Worker, you access Workers AI through the env.AI binding. This is Cloudflare's native pattern, not an SST abstraction:
// sst.config.ts
const worker = new sst.cloudflare.Worker("AIWorker", {
handler: "./ai-worker.ts",
url: true,
});
// ai-worker.ts
export default {
async fetch(request, env) {
const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
messages: [{ role: "user", content: "Summarize this document" }]
});
return Response.json(response);
}
};
Workers AI runs 50+ models on Cloudflare's edge network across 300+ locations. The model catalog includes Llama 4 Scout 17B, Llama 3.3 70B, DeepSeek R1 distilled variants, Mistral Small 3.1 24B, Qwen 2.5 Coder 32B, OpenAI GPT-OSS (120B and 20B), FLUX.2 for image generation, Whisper for speech-to-text, and multiple embedding models (BGE, EmbeddingGemma). Pricing is $0.011 per 1,000 Neurons (a unit measuring GPU compute per request), with a generous free tier of 10,000 Neurons/day. The Workers Paid plan costs $5/month for usage beyond free allocation. For a startup adding AI features to an existing app, Workers AI is often the fastest path to production - you get serverless GPU inference with no cold start management, edge-distributed latency, and an OpenAI-compatible API format.
Limitations matter, though. You cannot upload custom models (only Cloudflare's curated catalog), there's no fine-tuning, and large models (70B+) aren't practical on edge GPUs. Workers AI is for calling pre-deployed open-source models, not for custom ML workloads.
sst.aws.Task for CPU-bound ML workThe Task component spins up Fargate containers for async, long-running jobs. It supports up to 16 vCPU and 120 GB memory, EFS mounts for shared model files, and arm64 architecture for cost savings. At minimum configuration (0.25 vCPU + 0.5 GB), it costs roughly $0.02/hour - dramatically cheaper than Lambda for anything running longer than 15 minutes.
ML workloads that fit here: data preprocessing/ETL pipelines, scikit-learn or XGBoost inference, feature engineering, batch scoring with CPU-based models, model artifact packaging. In sst dev, tasks run locally via a configurable dev.command, making the development loop tight.
The hard limit: no GPU. Fargate does not support GPU instances, full stop. This constraint shapes everything downstream.
| Component | What it does for ML |
|---|---|
sst.aws.Vector |
pgvector-backed vector database on Aurora Serverless v2. Up to 2,000 dimensions. Built-in SDK with put, query, remove + metadata filtering. Your RAG starting point. |
sst.aws.StepFunctions (beta) |
Orchestrate multi-step ML pipelines across Lambda, ECS Task, Batch, CodeBuild. JSONata transforms. |
sst.aws.Cluster + Service |
ECS Fargate services for always-on inference containers. Supports Fargate Spot for dev environments. Still no GPU. |
sst.aws.Function |
Lambda functions (Python, Node, Rust, Go). 15-min timeout, 10 GB memory. Lightweight CPU inference only. |
sst.aws.Efs |
Elastic File System - share model files across Tasks and Services. |
sst.aws.Bucket |
S3 for model artifacts, training data, batch results. |
sst.aws.Queue |
SQS for async inference request queuing. |
What SST does NOT have native components for: SageMaker, Bedrock, AWS Batch, or any GPU compute. For those, you drop down to Pulumi.
sst.config.tsSince SST v4 runs on Pulumi (AWS provider v7.21.0), every @pulumi/aws resource is available directly in your config file. This means you can provision GPU infrastructure, SageMaker endpoints, Bedrock agents, and complex ML pipelines without leaving your SST project.
AWS Batch is a managed job scheduler with zero surcharge - you pay only for the underlying EC2 instances. This makes it the cheapest path to GPU compute on AWS for training and batch inference.
import * as aws from "@pulumi/aws";
const gpuEnv = new aws.batch.ComputeEnvironment("mlGpu", {
type: "MANAGED",
computeResources: {
type: "SPOT", // 60-90% savings vs on-demand
instanceTypes: ["g5.2xlarge"],
maxVcpus: 16,
minVcpus: 0, // scale to zero when idle
allocationStrategy: "SPOT_CAPACITY_OPTIMIZED",
ec2Configurations: [{ imageType: "ECS_AL2_NVIDIA" }],
// ... security groups, subnets, IAM
},
});
Spot instances save 60–90% off on-demand GPU pricing. A g5.xlarge (A10G GPU) runs roughly $1.00/hr on-demand but $0.30–0.40/hr on Spot. For fault-tolerant training with checkpointing every 15 minutes, Spot is a no-brainer. Supported GPU instance families include g4dn (T4, ~$0.53/hr), g5 (A10G, ~$1.01/hr), g6 (latest gen), p3 (V100, ~$3.06/hr), p4d (A100, ~$21.96/hr), and p5 (H100, ~$98.32/hr). Set minVcpus: 0 and instances terminate when no jobs are running - true zero idle cost.
Pulumi's aws.sagemaker.* namespace covers the full SageMaker lifecycle: Model, EndpointConfiguration, Endpoint, NotebookInstance, Domain, FeatureGroup, Pipeline, and MonitoringSchedule. A real-time inference endpoint provisioned in your SST config looks like:
const model = new aws.sagemaker.Model("myModel", {
executionRoleArn: sagemakerRole.arn,
primaryContainer: {
image: "763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:2.0-gpu",
modelDataUrl: "s3://my-bucket/model.tar.gz",
},
});
const config = new aws.sagemaker.EndpointConfiguration("myConfig", {
productionVariants: [{
instanceType: "ml.g4dn.xlarge",
initialInstanceCount: 1,
modelName: model.name,
variantName: "AllTraffic",
}],
});
const endpoint = new aws.sagemaker.Endpoint("myEndpoint", {
endpointConfigName: config.name,
});
The aws.bedrock.* namespace includes 25+ resources for provisioning Bedrock infrastructure as code: AgentAgent, AgentKnowledgeBase, AgentDataSource, CustomModel, Guardrail, ProvisionedModelThroughput, and more. You can wire up a complete RAG pipeline - S3 data source → embedding → Aurora pgvector store → Bedrock Knowledge Base → Agent - entirely within sst.config.ts. Community examples demonstrate building full Bedrock Knowledge Bases with SST using raw Pulumi resources.
Other Pulumi ML resources worth knowing: aws.ecs.TaskDefinition with GPU resource requirements (EC2 launch type only), aws.sfn.StateMachine for Step Functions ML orchestration (supports native SageMaker CreateTrainingJob, Batch SubmitJob, Bedrock InvokeModel actions), and aws.bedrockfoundation.getModels to query available foundation models dynamically.
SageMaker is AWS's kitchen-sink ML platform. Here's the honest startup-relevance breakdown:
Use immediately:
Use when you have production models:
Use when scale demands it:
Skip unless specific need:
| Mode | How it works | Cold start | Max memory | GPU? | Best for |
|---|---|---|---|---|---|
| Real-time | Always-on instances. Since Nov 2024, can scale to zero. | None (warm) | Unlimited | Yes | Consistent traffic, sub-second latency |
| Serverless | Auto-provisions on request, scales to zero | 1–3 seconds | 6 GB | No | Intermittent traffic, cost-sensitive |
| Async | Queue-based (S3 in/out), scales to zero | Variable | Unlimited | Yes | Large payloads, long inference |
| Batch Transform | Process entire dataset, then shut down | N/A | Unlimited | Yes | Offline scoring, scheduled predictions |
SageMaker Serverless Inference deserves special attention for startups. It charges ~$0.00008/second at 4 GB memory plus $0.016/GB data processed. At 10M invocations × 100ms × 2 GB, that's roughly $40/month. The crossover point where an always-on ml.m5.xlarge (~$196/month) becomes cheaper is around 800,000 requests/month at 500ms per inference. Key limitations: CPU-only, 6 GB max memory, 4 MB max payload, ~60-second timeout, no VPC support, no Model Monitor. Models above ~3–4 GB compressed won't fit. Concurrency caps at 200 per region across all serverless endpoints.
AWS Lambda does not support GPU and never will. This is an architectural constraint of Firecracker (Lambda's micro-VM technology), which was explicitly designed without hardware accelerator passthrough. No re:Invent announcement has changed this.
What Lambda can do for ML inference: run ONNX Runtime models (17% faster than PyTorch for single inference), scikit-learn, LightGBM, XGBoost, and small quantized models (e.g., 4-bit GGUF models under 10 GB). Current limits are 10 GB memory, 10 GB container image, 10 GB /tmp storage, 15-minute timeout. One practitioner reported cutting inference costs from $36–171/month (SageMaker) to $3–25/month with Lambda + ONNX for a side project. Lambda SnapStart (available for Python) reduces cold starts from ~16 seconds to ~1.6 seconds for ML model loading. Best memory allocation for ML is typically 2,048 MB - doubling memory halves execution time until a plateau around 4,096 MB.
AWS Fargate does not support GPU as of early 2026. The GitHub roadmap issue (#88) has been open for years, listed as "Work in Progress." AWS's own FAQ explicitly states: "Use Amazon EC2 for GPU workloads, which are not supported on AWS Fargate today." Some speculative sources mention enhanced GPU support coming in 2026, but no official GA announcement exists. Treat as unavailable for production planning.
The workaround is ECS on EC2 with GPU instances: use the ECS GPU-optimized AMI, set ECS_ENABLE_GPU_SUPPORT=true, and specify GPU resource requirements in your task definition. This requires managing EC2 capacity providers instead of Fargate's managed scaling - a meaningful operational burden for a solo developer. Instance options range from g4dn.xlarge at ~$0.53/hr (T4, good for inference) to p5.48xlarge at ~$98.32/hr (H100, frontier training only).
| Factor | AWS Batch | SageMaker Training |
|---|---|---|
| Surcharge | None - raw EC2 pricing | ~15–20% premium |
| Container control | Full Docker, any image | Must conform to SageMaker container contract |
| ML features | Basic scheduling | Distributed training, HPO, debugging, profiling |
| Integration | Manual S3/CloudWatch | Native Experiments, Model Registry, Pipelines |
| Best for | Cost optimization, custom containers | Managed ML experience, integrated MLOps |
Amazon's own Search team uses AWS Batch for SageMaker Training jobs, increasing GPU utilization from 40% to 80%. For a startup doing occasional training runs, Batch + Spot is likely the cheapest path.
AWS Bedrock provides ~100 serverless models including Claude Opus 4.5/Sonnet 4.5/Haiku 4.5, Llama 4, DeepSeek V3.2, Mistral Large 3, Amazon Nova 2, GPT-OSS-120B, Gemma 3, and models from Cohere, AI21, MiniMax, and Moonshot AI. Key per-million-token pricing:
| Model | Input | Output |
|---|---|---|
| Claude Sonnet 4.5 | ~$3.00 | ~$15.00 |
| Claude Haiku 4.5 | ~$1.00 | ~$5.00 |
| Llama 3.3 70B | ~$0.72 | ~$0.72 |
| DeepSeek V3.2 | $0.62 | $1.85 |
| GPT-OSS-120B | $0.15 | $0.62 |
| Mistral Large 3 | $0.50 | $1.50 |
| NVIDIA Nemotron Nano 2 | $0.06 | $0.23 |
Bedrock's Flex tier offers a 50% discount for lower-priority processing. Key features beyond raw inference: Knowledge Bases (managed RAG with S3 → auto-chunking → embedding → vector store, supporting Aurora PostgreSQL, OpenSearch, Pinecone), Agents (autonomous tool-using agents with MCP support), Guardrails (content filtering, PII masking, hallucination detection with automated reasoning), and fine-tuning for select models (Llama, Titan, GPT-OSS).
AWS's custom chips offer up to 70% lower cost per inference (Inferentia2) and up to 50% lower training costs (Trainium) vs comparable GPUs. The latest Trainium3 (3nm, announced Dec 2025) delivers 2.52 PFLOPS FP8 per chip. The ecosystem requires the Neuron SDK and primarily supports PyTorch with transformer architectures.
For startups: skip these until inference costs exceed ~$10K/month. The Neuron SDK learning curve and limited model architecture support add friction that isn't worth it early on. Stick with GPUs for maximum flexibility during experimentation.
Step Functions provides native integrations with SageMaker (CreateTrainingJob, CreateTransformJob, etc.), Batch (SubmitJob), Lambda, and ECS - all with .sync wait-for-completion support. Pricing is $0.025 per 1,000 state transitions for Standard workflows. Use Step Functions when you need to mix ML and non-ML services in a pipeline. Use SageMaker Pipelines when you're fully in the SageMaker ecosystem and want built-in lineage tracking and approval workflows.
This is where the landscape gets interesting for a startup. Several external platforms solve problems that AWS cannot - particularly serverless GPU with scale-to-zero.
Modal is the developer experience winner. A Python-first serverless GPU platform where you decorate functions with GPU requirements and Modal handles everything: containerization, provisioning, auto-scaling, scale-to-zero. Pricing is per-second - an H100 costs ~$3.95/hr, an A100-80GB ~$2.50/hr, a T4 ~$0.59/hr. Cold starts are typically 2–4 seconds thanks to a custom container runtime that's 100x faster than Docker. Modal offers $30/month free credits on the Starter plan and up to $25K in startup credits. Best for: Python-native ML teams who want zero infrastructure management, variable workloads, fine-tuning jobs, and inference APIs.
RunPod offers the cheapest raw GPU pricing. Their serverless offering has two modes: Flex Workers (scale-to-zero, pay only when processing) and Active Workers (always-on, 20–30% discount). An A100-80GB runs ~$1.64/hr on Community Cloud, ~$2.17/hr on Secure Cloud. FlashBoot achieves sub-200ms cold starts for 48% of requests. No egress fees. Best for: cost-sensitive GPU workloads at scale, image generation, custom model serving.
Replicate provides the broadest model library with the lowest setup friction. Thousands of community models accessible via a simple API, plus 100+ "official" models that are always warm with no cold starts. Pricing is per-prediction for official models or per-second of GPU time for custom models (T4 ~$0.81/hr, A100-80GB ~$5.04/hr). Replicate was acquired by Cloudflare in 2025. Best for: rapid prototyping, running popular open-source models without infrastructure.
Groq is the speed king for LLM inference. Their custom Language Processing Unit (LPU) - an all-SRAM, deterministic-execution chip - delivers 300–1,300+ tokens/second, which is 5–15x faster than GPU-based inference. Pricing is aggressively low: Llama 3.1 8B at $0.05/$0.08 per million input/output tokens, Llama 4 Scout at $0.11/$0.34, Llama 3.3 70B at $0.59/$0.79. Limitation: inference only (no training, no fine-tuning), limited to Groq-hosted models (~10 families). Best for: speed-critical LLM inference at the lowest per-token cost.
Together AI covers the widest open-source model selection with 200+ models, including Llama, Qwen, DeepSeek, and Mixtral variants. Pricing ranges from $0.06–$3.50 per million tokens. Their Batch API offers a 50% discount. Fine-tuning supported (LoRA and full). All endpoints are OpenAI-compatible, making provider switching trivial. Fireworks AI is a similar competitor with proprietary FireAttention engine claiming 4x lower latency than vLLM.
Deploy any of 60,000+ models from the Hub on dedicated infrastructure. GPU pricing: T4 ~$0.50/hr, A10G ~$1.00/hr, A100 ~$4.50/hr, H100 ~$8.00/hr. Scale-to-zero supported. Note that TGI (Text Generation Inference) entered maintenance mode in December 2025 - HF now recommends vLLM or SGLang for new deployments.
| Scenario | Best platform | Why |
|---|---|---|
| Add AI to existing app, lightweight | Cloudflare Workers AI | Edge-distributed, free tier, zero infra |
| LLM API, speed critical | Groq | 5–15x faster, cheapest per-token |
| LLM API, model variety needed | Together AI | 200+ models, OpenAI-compatible |
| Serverless GPU, Python team | Modal | Best DX, per-second billing, $30 free/month |
| Cheapest raw GPU compute | RunPod | Lowest hourly rates, no egress fees |
| Run popular models instantly | Replicate | Always-warm official models, vast library |
| Deploy HuggingFace models | HF Inference Endpoints | Seamless Hub integration |
| Batch LLM processing | Together AI (50% off) or Fireworks (40% off) | Steep batch discounts |
GCP Cloud Run supports GPU, and it's been GA since June 2025. NVIDIA L4 GPUs (24GB) run at ~$0.67/hr with per-second billing, scale-to-zero, and ~5-second cold starts. NVIDIA RTX PRO 6000 Blackwell (96GB, supporting 70B+ models) is in preview. This is the serverless GPU container solution AWS hasn't shipped. For a startup that needs GPU inference with scale-to-zero economics and doesn't want to manage EC2 capacity providers, Cloud Run GPU is the most direct answer.
Beyond Cloud Run, GCP offers Vertex AI (more unified and easier to learn than SageMaker), TPUs (unique to GCP, best for large Transformer training in JAX/TensorFlow), and tight BigQuery integration for data-heavy ML workflows. Gemini 2.5 Pro at $1.25/$10 per million tokens is competitive with Claude and GPT-5.
Azure's strongest ML play is Azure OpenAI Service: exclusive managed access to GPT-4, GPT-4o, GPT-5 family, DALL-E, and Whisper with enterprise security, VNet integration, and RBAC. No equivalent exists on AWS or GCP. Azure Container Apps now supports serverless GPUs (T4 and A100) in public preview. For enterprises already in the Microsoft ecosystem, Azure ML offers strong governance, Visual Designer, and hybrid edge deployments via Azure Arc.
AWS leads on: breadth of services (200+ integrations), SageMaker maturity (since 2017, the most feature-complete ML platform), Bedrock's multi-provider model access, custom silicon (Inferentia/Trainium), and most importantly - your existing infrastructure is here. Moving data layers is expensive and risky.
AWS's key gaps: no GPU Fargate (the biggest pain point vs Cloud Run GPU), no native OpenAI model access (must use Bedrock alternatives or call OpenAI directly), steeper SageMaker learning curve vs Vertex AI, and SageMaker Serverless Inference is CPU-only.
Stay on AWS for your data layer, core infrastructure, and most ML services. Reach outside for specific capabilities:
Abstract your LLM calls behind a router (LiteLLM, Cloudflare AI Gateway) so you can switch providers without code changes.
Just call an API (Bedrock, OpenAI, Anthropic direct, Together AI): This is the right starting point for 90% of startup LLM use cases. Zero ops overhead. Focus on product-market fit, not infrastructure. Use when: chat, summarization, code generation, classification, embeddings. Cost: $50–500/month for moderate usage.
Serverless CPU inference (Lambda + ONNX, SageMaker Serverless, Workers AI): For sporadic traffic with acceptable cold starts. Lambda + ONNX can run scikit-learn/XGBoost models for $3–25/month. SageMaker Serverless handles larger models up to 6 GB for ~$40/month at 10M invocations. Workers AI gives you GPU inference on pre-deployed models with a 10K Neurons/day free tier. Best when: traffic is spiky, cost is paramount, latency tolerance is 1–5 seconds.
Batch processing (AWS Batch + Spot GPU, SageMaker Batch Transform): For training runs, bulk inference on datasets, feature engineering pipelines, and hyperparameter sweeps. Pay only during execution, zero idle cost. Batch + Spot GPU saves 60–90% vs on-demand. Best when: latency tolerance is hours, workloads are fault-tolerant with checkpointing.
Always-on inference (ECS on EC2 with GPU, SageMaker Real-Time Endpoints): For consistent, high-volume traffic requiring sub-100ms latency. SageMaker Real-Time now supports scale-to-zero (since Nov 2024), narrowing the gap with serverless. An ml.g5.xlarge endpoint costs ~$1,030/month always-on but handles thousands of requests per second. Best when: traffic exceeds ~800K requests/month, latency SLA below 100ms.
Managed MLOps (SageMaker end-to-end): When you have 3+ ML engineers, multiple models in production, and need experiment tracking, model registry, pipeline orchestration, and monitoring as an integrated system. Before that scale, Docker + ECS + CloudWatch + GitHub Actions is a perfectly valid ML platform.
| Monthly requests | Recommended approach | Approximate cost |
|---|---|---|
| < 1,000 | Lambda or API calls | < $10/month |
| 1K–100K | SageMaker Serverless or Lambda + ONNX | $10–100/month |
| 100K–1M | ECS/Fargate or SageMaker Serverless | $50–500/month |
| 1M–10M | ECS with auto-scaling or SageMaker Real-Time | $200–2,000/month |
| 10M+ | SageMaker Real-Time with auto-scaling or self-hosted | $1,000+/month |
The foundation model market has compressed dramatically. Here are the models that matter most for startups as of early 2026:
| Provider | Model | Input/MTok | Output/MTok | Best for |
|---|---|---|---|---|
| Anthropic | Claude Opus 4.6 | $5.00 | $25.00 | Complex reasoning, agents, coding |
| Anthropic | Claude Sonnet 4.5 | $3.00 | $15.00 | Balanced quality/cost |
| Anthropic | Claude Haiku 4.5 | $1.00 | $5.00 | Fast, cost-optimized |
| OpenAI | GPT-5.2 | $1.75 | $14.00 | General purpose flagship |
| OpenAI | GPT-5 Mini | $0.25 | $2.00 | Budget, high-volume |
| OpenAI | GPT-5 Nano | $0.05 | $0.40 | Cheapest GPT |
| Gemini 2.5 Pro | $1.25 | $10.00 | Long context (1M), reasoning | |
| Gemini 2.0 Flash | $0.15 | $0.60 | Ultra-cheap, high-volume | |
| Meta (via Bedrock) | Llama 3.3 70B | $0.72 | $0.72 | Open-source, fine-tunable |
| AWS | Amazon Nova 2 Pro | Varies | Varies | Native AWS, 1M context |
Both Anthropic and OpenAI offer prompt caching (90% savings on cached input tokens) and batch processing (50% discount). These features alone can cut LLM costs by 50–90% for many workloads.
Use Bedrock when you want unified billing, VPC-level security, IAM integration, built-in RAG (Knowledge Bases), Agents, Guardrails, and the ability to switch between models through a unified API without infrastructure changes. Data never leaves AWS. Batch inference at 50% discount via the Flex tier.
Use the direct Anthropic/OpenAI API when you need the absolute latest model versions (Bedrock lags slightly), provider-specific features (Anthropic's extended thinking, OpenAI's realtime API), or simpler integration for a single provider.
Self-hosting LLMs (via vLLM, Ollama, or SGLang) makes financial sense only at high volume. The breakeven is roughly 30M+ tokens/day - below that, API costs are lower because you avoid paying for idle GPU capacity. A single A100-80GB on AWS costs ~$2,000+/month regardless of utilization.
vLLM is the production standard for self-hosted LLM serving: 793 tokens/sec peak throughput, PagedAttention for 40%+ memory savings, OpenAI-compatible API, distributed inference across GPUs. Ollama is best for local development only (caps at ~4 parallel requests). TGI entered maintenance mode in December 2025 - new projects should use vLLM or SGLang.
For most startups: just call an API. Self-host only when you hit $10K+/month in API spend with >70% GPU utilization potential, have strict compliance requirements (air-gapped environments), or need custom model modifications that API providers don't support.
Your Aurora Serverless v2 supports the pgvector extension, and SST's sst.aws.Vector component wraps it with a clean SDK. For RAG applications with under 10 million vectors, pgvector matches or beats dedicated vector databases on performance. With pgvectorscale, benchmarks show 471 QPS at 99% recall on 50M vectors with p95 latency of 28ms. HNSW indexing and iterative scan (pgvector 0.8.0) delivered a 5.7x query performance improvement. You get ACID transactions, combined relational + vector queries, and zero additional infrastructure cost. Only evaluate Pinecone or Qdrant Cloud if you outgrow 10M vectors or need auto-scaling for extremely spiky traffic.
Start with MLflow (self-hosted on Fargate) + CloudWatch. MLflow is free, open-source, and provides experiment tracking, model packaging, and a model registry. Host the tracking server on a small Fargate task backed by your Aurora database and S3 for artifacts. CloudWatch handles infrastructure metrics for SageMaker endpoints and ECS services. Add Arize Phoenix (free, open-source) or Evidently AI (free, open-source) for production model monitoring when you have models deployed. Only consider W&B Pro ($50/user/month) or Arize AX Pro ($50/month) when you have 5+ models in production.
Use DVC (Data Version Control) for Git-native data and model versioning - it stores lightweight pointer files in Git while actual artifacts live in S3. Combined with MLflow Model Registry's staging workflow (None → Staging → Production → Archived), this gives you reproducible experiments and controlled promotion without SageMaker-level complexity. For A/B testing, SageMaker supports Production Variants with traffic splitting and shadow deployments (new model receives real traffic but responses aren't returned to users). Canary deployments with automatic rollback via CloudWatch alarms are also built in.
SPOT_CAPACITY_OPTIMIZED allocation strategy.For a startup, this is enough:
One team reported running this entire pipeline for < $820/month. The full cycle from push to staged deployment takes ~17–22 minutes.
Deploy SageMaker and Bedrock resources in VPC subnets with network isolation. Use VPC Interface Endpoints (PrivateLink) for SageMaker Runtime and Bedrock to keep ML traffic off the public internet. Scope IAM policies with least-privilege - bedrock:InvokeModel restricted to specific model ARNs, separate execution roles for training vs inference. Enable KMS encryption on all S3 model artifact buckets. If targeting healthcare or finance, plan security architecture from day 1 - retrofitting HIPAA/SOC2 compliance is expensive and painful.
The ML deployment landscape for SST v4 builders breaks into a clear decision hierarchy. For LLM features, start by calling an API - Bedrock for AWS-native integration, or Anthropic/OpenAI/Together AI directly for latest models and lowest latency. Your existing Aurora Serverless v2 with pgvector is a production-ready vector store for RAG. For custom model inference, Workers AI and Lambda handle lightweight CPU workloads at near-zero cost. When you need GPU, don't fight AWS's Fargate/Lambda limitations - reach for Modal or GCP Cloud Run GPU for serverless GPU containers, or AWS Batch with Spot instances for training jobs at 60–90% savings.
The insight most builders miss: the operational complexity gap between "call an API" and "manage GPU infrastructure" is enormous. A solo developer calling Claude via Bedrock can ship an AI feature in a day. That same developer provisioning SageMaker endpoints with auto-scaling, model monitoring, and CI/CD needs weeks. Exhaust the API-first approach before reaching for infrastructure. When you do need infrastructure, Pulumi inside your sst.config.ts can provision AWS Batch, SageMaker, and Bedrock resources without leaving your deployment stack - and external platforms like Modal and Groq fill the gaps where AWS falls short. The goal isn't to build an MLOps platform; it's to ship ML features that your users value.