Vertex AI Pocket Book — Uplatz
50 in-depth cards • Wide layout • Readable examples • Interview Q&A included
1) What is Vertex AI?
Vertex AI is Google Cloud’s end-to-end ML/GenAI platform: data prep, training, tuning, deployment, vector search, pipelines, monitoring, and access to Google foundation models (e.g., Gemini) via a unified API. It integrates with BigQuery, Cloud Storage, Dataflow, Pub/Sub, and GKE. You can bring your own models, fine-tune foundation models, or use AutoML for tabular, vision, and text tasks. Security integrates with IAM, VPC-SC, CMEK, and audit logs.
pip install google-cloud-aiplatform
gcloud auth application-default login
2) Workbench & Notebooks
Vertex AI Workbench gives managed Jupyter-based notebooks with GCP integration (BigQuery, GCS), idle-shutdown, and one-click GPU/TPU switching. Attach service accounts with least privilege and place in private subnets if required. Use scheduled notebooks for ETL/feature jobs when Pipelines isn’t necessary.
from google.cloud import bigquery
bq = bigquery.Client()
bq.query("SELECT COUNT(*) FROM `project.dataset.table`").result()
3) Model Garden & Gemini
Model Garden exposes foundation models (Gemini family), third-party, and Google-hosted OSS models under a consistent API. You can prompt, tune, and deploy with safety filters and usage controls. Choose Gemini variants for cost/latency/quality tradeoffs, and enable caching or streaming outputs as needed.
from google.cloud import aiplatform
from vertexai.preview.generative_models import GenerativeModel
aiplatform.init(project="YOUR_PROJECT", location="us-central1")
model = GenerativeModel("gemini-1.5-pro")
print(model.generate_content("Summarize this doc: ...").text)
4) Endpoints & Deployments
Deploy your custom models (SavedModel, PyTorch, XGBoost, scikit-learn) to endpoints with autoscaling and traffic splitting. Use A/B testing, canary rollouts, and model monitoring. Configure minimum/maximum replicas, health checks, and request/response logging; attach GPUs for deep learning inference.
from google.cloud import aiplatform as aip
endpoint = aip.Endpoint.create(display_name="churn-endpoint")
model = aip.Model.upload(display_name="churn-xgb", artifact_uri="gs://bucket/model/")
model.deploy(endpoint=endpoint, machine_type="n1-standard-4")
5) Datasets & BigQuery
Datasets can live in BigQuery tables or GCS. For tabular ML, BigQuery ML or AutoML Tables style flows work well; for GenAI RAG, store documents in GCS/BigQuery and index in Vertex AI Vector Search. Keep data residency, encryption, and lineage documented (Data Catalog).
# Load from BigQuery in Python
import pandas_gbq
df = pandas_gbq.read_gbq("SELECT * FROM `proj.ds.customers` LIMIT 1000")
6) AutoML: Vision, Text, Tabular
AutoML trains models from labeled data without heavy ML engineering. Provide labeled datasets and Vertex chooses architectures/hypers. Evaluate with built-in metrics; export confusion matrices and feature importance for tabular. Use batch prediction or deploy to endpoints.
# CLI concept
gcloud ai custom-jobs create --region=us-central1 --display-name=automl-vision
7) Custom Training with GPUs/TPUs
Submit custom jobs using containers or prebuilt frameworks. Scale on multiple accelerators; log metrics to Cloud Logging/Monitoring. Package code with requirements and stage on GCS. Use Vertex AI Training or GKE for large-scale distributed training with TF/XLA or PyTorch DDP.
aip.CustomJob(
display_name="trainer",
worker_pool_specs=[{
"machine_spec":{"machine_type":"a2-highgpu-1g","accelerator_type":"NVIDIA_TESLA_A100","accelerator_count":1},
"replica_count":1,
"container_spec":{"image_uri":"gcr.io/your/trainer:latest", "args":["--epochs","5"]}
}]
).run()
8) Pipelines (KFP)
Vertex AI Pipelines (Kubeflow Pipelines on GCP) orchestrates reproducible ML workflows with lineage and caching. Define components, compile to a pipeline spec, and schedule. Artifacts, metadata, and metrics are tracked for compliance/audit.
from kfp import dsl
@dsl.component
def add(a:int,b:int) -> int: return a+b
@dsl.pipeline
def p(): add(2,3)
# Compile & upload via aiplatform.PipelineJob
9) Feature Store & Feast
Use Vertex AI Feature Store (v2 integrates with BigQuery) to manage offline/online features with consistency and low-latency serving. Define feature specs, ingest from BQ/Batch jobs, and serve to models via online stores. For OSS patterns, Feast can pair with Vertex components.
# Concept: define features then ingest from BigQuery scheduled queries
10) Q&A — “AutoML vs Custom Training?”
Answer: Use AutoML for quick, strong baselines and when you lack deep ML expertise. Choose Custom Training if you need full control over architectures, libraries, distributed strategies, or specialized loss functions. Many teams start with AutoML, then move to Custom for the last-mile gains.
11) Gemini Text & Multimodal
Gemini models (text, code, multimodal) power summarization, classification, content generation, tool use, and more. Use streaming for low-latency chat UIs and function calling for tool-augmented agents. Configure safety settings (harassment, hate, etc.) per use case.
from vertexai.preview.generative_models import GenerativeModel
model = GenerativeModel("gemini-1.5-flash")
resp = model.generate_content(["Explain in 3 bullet points:", "Vertex AI Pipelines"])
print(resp.text)
12) Tuning (Adapters/LoRA)
Parameter-efficient tuning lets you specialize foundation models with small domain datasets. Provide prompt-completion pairs or instruction datasets; evaluate with held-out metrics and human review. Adapters are applied at inference time, preserving base weights.
# Pseudocode: submit a tuning job referencing a GCS dataset of JSONL prompts
13) Embeddings & Similarity
Use embeddings to transform text/images into vectors for semantic search, clustering, and retrieval augmentation. Store vectors in Vertex AI Vector Search (managed ANN) or BigQuery Vector for analytics integration. Choose dimensionality and distance metric appropriately.
from vertexai.language_models import TextEmbeddingModel
emb = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")
vec = emb.get_embeddings(["Retrieval Augmented Generation"])[0].values
14) Vector Search
Vertex AI Vector Search offers low-latency approximate nearest neighbor search with filtering. Create indexes, upsert vectors, and query at runtime. Use for RAG, recommendations, and deduplication. Keep metadata (doc_id, source) for attribution and traceability.
# Concept: create index, upsert embeddings, then query with a vector and filter
15) RAG with Vertex
Implement RAG by chunking documents from GCS/BigQuery, generating embeddings, storing in Vector Search, and retrieving top-k contexts to ground prompts. Add citation links, metadata filters, and freshness signals. Cache retrievals to reduce cost/latency.
# Pseudocode: chunks -> embed -> upsert -> query -> prompt(model, context)
16) Tool Use & Function Calling
Define tools (functions with JSON schemas). The model can propose a tool call; your app executes it and returns results to the model for final completion. Useful for database lookup, web calls, and transactional flows with human-in-the-loop safeguards.
# Concept: tools=[{"name":"getWeather","schema":{...}}] passed to generate_content()
17) Safety, Data Governance & Grounding
Use safety filters, blocklists, PII redaction, and prompt hardening. Disable data logging if required, and configure CMEK/VPC-SC. For factual tasks, ground outputs via RAG and include citations. Add evals and human review queues for sensitive domains.
# Configure safety: parameters in generate_content(), server-side policies via console
18) Prompt Engineering
Structure prompts with role, instructions, constraints, examples, and format expectations. Use delimiters for context, ask for JSON output with a schema, and chain prompts for complex tasks. Add system prompts for style/voice and few-shot examples to steer behavior.
prompt = """You are a concise assistant.
Constraints: bullet list, 3 items.
Topic: Vertex AI safety controls."""
19) GenAI Evaluation
Evaluate with automatic metrics (BLEU/ROUGE for text, embedding similarity) and human ratings (helpfulness, harmlessness, honesty). Use golden sets and adversarial tests. Track drift, hallucination rates, and safety policy violations over time with dashboards.
# Store evals in BigQuery for analysis & dashboards
20) Q&A — “When to choose Gemini Flash vs Pro?”
Answer: Use Flash for low-latency, high-throughput tasks (UI autocomplete, quick summaries). Choose Pro for higher reasoning quality, complex instructions, or longer context. Benchmark your tasks—often a hybrid (Flash for previews, Pro for finalize) balances UX and cost.
21) CI/CD for Models
Adopt Git-based workflows with Cloud Build/GitHub Actions to build, test, and deploy models/pipelines. Store artifacts in Artifact Registry. Use environments (dev/stage/prod) with approvals and canaries. Version datasets, code, and model weights.
# Cloud Build step (concept)
gcloud ai models upload --display-name my-model --artifact-uri gs://bucket/model
22) Model Registry
Register models with metadata (version, metrics, lineage). Promote through stages, attach evaluation reports, and tie to endpoints. Enforce checks (schema, bias, safety) before production. Keep changelogs and rollback plans.
# Track versions and link to PipelineJob runs & datasets
23) Data & Model Lineage
Vertex ML Metadata records artifacts, executions, and contexts. This enables audits and reproducibility (“this model came from dataset X via pipeline Y”). Integrate with Data Catalog for dataset governance.
# Access lineage via Vertex console or Metadata APIs
24) Model Monitoring
Monitor prediction drift, data skew, and performance. Configure alerting via Cloud Monitoring. Capture request/response samples, compute feature statistics, and trigger re-training when drift exceeds thresholds. For GenAI, log prompt/response pairs for safety and quality reviews.
# Concept: enable logging/monitoring on Endpoint; export to BigQuery
25) Batch vs Online Prediction
Batch prediction is cost-efficient for large offline scoring jobs (e.g., nightly segments). Online endpoints serve latency-sensitive requests. Often both coexist: batch for bulk updates, online for personalization at request time. Keep model code identical across modes.
# Batch predict
aip.BatchPredictionJob.create(job_display_name="score", model=model.resource_name, instances_format="jsonl", gcs_source=...)
26) Cost Controls
Right-size machines/accelerators, enable autoscaling, and set quotas/budgets. Cache embeddings, reuse vector indexes, and stream responses. For pipelines, enable caching and shut down idle resources. Track per-project costs in BigQuery billing exports.
# Budgets & alerts in Cloud Billing; labels for cost allocation
27) Testing ML Systems
Unit test data transformations; integration test pipelines; canary test endpoints. Maintain golden datasets and backtesting harnesses. For GenAI, build red-team suites and toxicity/factuality tests. Automate in CI.
# pytest + sample JSONL prompts; assert structure & policy scores
28) Governance & Responsible AI
Document model cards, data sheets, and intended use. Apply bias checks, consent/logging controls, PII handling, and safety filters. Provide user controls for opt-out and human escalation. Record decisions for audits.
# Store governance artifacts in GCS with signed URLs for reviews
29) Hybrid & Private Networking
Access on-prem/private data via Private Service Connect/VPC-SC. Place endpoints in regions near users/data. Use CMEK for encryption and restrict egress with Cloud NAT + firewall rules. For strict environments, add approval gates.
# VPC-SC perimeter with restricted services & projects
30) Q&A — “How do I trigger retraining safely?”
Answer: Monitor drift and performance thresholds; when breached, kick off a PipelineJob that validates data schema, runs training, evaluates against golden sets, compares against champion, and only promotes if it beats guardrails. Use canary deployments with shadow traffic before full cutover.
31) BigQuery ML vs Vertex AI
BQML trains models directly in SQL (linear, boosted trees, deep nets, ARIMA, XGBoost). Great for analysts and fast iteration. Vertex AI is better for custom training, GenAI, feature stores, vector search, and full MLOps. Combine them: train in BQML, serve via Vertex endpoints if needed.
CREATE OR REPLACE MODEL ds.churn OPTIONS(MODEL_TYPE='LOGISTIC_REG') AS SELECT ...;
32) Dataflow & ETL
Use Dataflow (Apache Beam) for scalable ETL/feature pipelines. Stream from Pub/Sub to BigQuery and Feature Store; embed featurization and windowing. Keep schemas versioned and test with synthetic data.
# Python Beam skeleton reading Pub/Sub and writing to BQ
33) Pub/Sub for Real-time Scoring
Publish events to Pub/Sub; trigger Cloud Functions/Run that call Vertex endpoints. Add retries with dead-letter topics, idempotency, and timeout budgets. Log request IDs for traceability.
# Cloud Run handler calls endpoint.predict(payload)
34) Cloud Run GenAI APIs
Wrap Gemini calls in a Cloud Run microservice with auth, rate limits, and caching. Stream SSE to the frontend for typing-effect UX. Keep system prompts in Config; rotate keys and apply quotas per tenant.
# Flask/FastAPI + vertexai SDK + streaming response
35) Document AI + RAG
Extract text/structure with Document AI, chunk and index in Vector Search, then build a Gemini-powered Q&A over your PDFs. Keep provenance links and confidence scores; redact PII if needed.
# Pipeline: GCS -> DocAI -> chunks -> embeddings -> index
36) Images & Vision APIs
Use Vertex Vision models for classification/detection or tune for your labels. For generative images, call the appropriate model endpoints (when available in your region). Store prompts/outputs for audit and safety review.
# Upload labeled images to GCS, start AutoML Vision training
37) Code Assist & Agents
Build internal code assistants with Gemini Code models, function calling, and repository retrieval. Add guardrails (no secrets), explain diffs, and propose patches. For support bots, combine conversational state with RAG and action tools (ticket systems).
# Tool-enabled chat pipeline with user/session context
38) Evaluations at Scale
Run batch evals in Pipelines against curated prompts. Capture metrics (accuracy, toxicity, refusal rates), store in BigQuery, and visualize in Looker Studio. Automate regression checks on every model or prompt change.
# Pipeline step writes eval JSONL to BigQuery for dashboards
39) Multi-Region & DR
Choose regions close to data/users; replicate artifacts and indexes. Use separate projects for isolation, per-env service accounts, and org policies. Test failovers and rate-limit fallbacks in client apps.
# Artifact Registry mirrors; dual endpoints with traffic split
40) Q&A — “How to secure GenAI endpoints?”
Answer: Enforce IAM, private networking (PSC), request auth (ID tokens), quotas, and per-tenant limits. Apply safety filters, prompt wrapping, output validation (JSON schemas), and PII redaction. Log prompts/outputs with data retention controls and build abuse monitoring.
41) Quickstart: Create Endpoint & Predict (Python)
Initialize, upload, deploy, predict; tear down when done to save cost. Ensure service account has aiplatform.user
and storage access; set region explicitly.
from google.cloud import aiplatform as aip
aip.init(project="P", location="us-central1")
m = aip.Model.upload(display_name="clf", artifact_uri="gs://bucket/model")
ep = aip.Endpoint.create(display_name="clf-ep")
m.deploy(endpoint=ep, machine_type="n1-standard-2")
print(ep.predict(instances=[{"x":1,"y":2}]).predictions)
42) Quickstart: Gemini REST (curl)
Call the generative endpoint via REST with OAuth access token. Prefer server-to-server calls; never expose tokens in the browser. Stream when building chats.
curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://.../projects/PROJECT/locations/us-central1/publishers/google/models/gemini-1.5-pro:generateContent \
-d '{"contents":[{"role":"user","parts":[{"text":"Explain Vertex AI Pipelines"}]}]}'
43) Prompt Template Pattern
Keep prompts in files with placeholders; inject variables server-side. Version prompts and track A/B results. For JSON outputs, validate against a schema and retry with error-aware prompts.
template = """Role: helpful assistant.
Output JSON with keys: steps[], risks[].
Topic: {topic}"""
prompt = template.format(topic="Model Monitoring")
44) RAG Chunking & Metadata
Chunk by semantic boundaries, store title/section/source_url, and use hybrid retrieval (BM25 + vectors). Re-rank candidates before prompting. Add citations in the final answer and cache results.
# Store metadata alongside vectors: {"doc_id": "...", "section": "...", "url": "..."}
45) Latency Tactics
Choose closer region, use “flash” models when feasible, enable streaming, cache embeddings, reuse HTTP connections, and pre-warm endpoints. For pipelines, use caching and parallelism; avoid tiny batch sizes.
# HTTP keep-alive + connection pooling in your client
46) Cost Tactics
Batch non-urgent requests, cap max tokens, set per-user quotas, and use retrieval caches. Downshift model tiers for drafts and upgrade for finalization. Delete unused endpoints and indexes.
# Track cost by labels: project, team, app, environment
47) Common Pitfalls
Forgetting region in SDK calls, mixing projects/SA scopes, leaving endpoints running, no safety filters, missing retries/timeouts, and unbounded prompt sizes. Fix with client wrappers, guardrails, and budgets.
# Always set: aiplatform.init(project="...", location="...")
48) Production Checklist
IAM + network controls, logging/metrics/traces, eval gates, canary rollouts, budget alerts, data retention, incident runbooks, and continuous red-teaming.
# Cloud Monitoring alerts: p95 latency, error rate, token spend
49) Reference Patterns
1) GenAI Chat with tools + RAG. 2) Classification endpoint + batch scoring. 3) Image AutoML + online detection. 4) Recommender with embeddings + vector search. 5) Code assistant with repo retrieval.
# Start simple, measure, iterate; codify patterns as reusable microservices