Google Vertex AI Pocket Book

Vertex AI Pocket Book — Uplatz

50 in-depth cards • Wide layout • Readable examples • Interview Q&A included

Section 1 — Overview & Building Blocks

1) What is Vertex AI?

Vertex AI is Google Cloud’s end-to-end ML/GenAI platform: data prep, training, tuning, deployment, vector search, pipelines, monitoring, and access to Google foundation models (e.g., Gemini) via a unified API. It integrates with BigQuery, Cloud Storage, Dataflow, Pub/Sub, and GKE. You can bring your own models, fine-tune foundation models, or use AutoML for tabular, vision, and text tasks. Security integrates with IAM, VPC-SC, CMEK, and audit logs.

pip install google-cloud-aiplatform
gcloud auth application-default login

2) Workbench & Notebooks

Vertex AI Workbench gives managed Jupyter-based notebooks with GCP integration (BigQuery, GCS), idle-shutdown, and one-click GPU/TPU switching. Attach service accounts with least privilege and place in private subnets if required. Use scheduled notebooks for ETL/feature jobs when Pipelines isn’t necessary.

from google.cloud import bigquery
bq = bigquery.Client()
bq.query("SELECT COUNT(*) FROM `project.dataset.table`").result()

3) Model Garden & Gemini

Model Garden exposes foundation models (Gemini family), third-party, and Google-hosted OSS models under a consistent API. You can prompt, tune, and deploy with safety filters and usage controls. Choose Gemini variants for cost/latency/quality tradeoffs, and enable caching or streaming outputs as needed.

from google.cloud import aiplatform
from vertexai.preview.generative_models import GenerativeModel
aiplatform.init(project="YOUR_PROJECT", location="us-central1")
model = GenerativeModel("gemini-1.5-pro")
print(model.generate_content("Summarize this doc: ...").text)

4) Endpoints & Deployments

Deploy your custom models (SavedModel, PyTorch, XGBoost, scikit-learn) to endpoints with autoscaling and traffic splitting. Use A/B testing, canary rollouts, and model monitoring. Configure minimum/maximum replicas, health checks, and request/response logging; attach GPUs for deep learning inference.

from google.cloud import aiplatform as aip
endpoint = aip.Endpoint.create(display_name="churn-endpoint")
model = aip.Model.upload(display_name="churn-xgb", artifact_uri="gs://bucket/model/")
model.deploy(endpoint=endpoint, machine_type="n1-standard-4")

5) Datasets & BigQuery

Datasets can live in BigQuery tables or GCS. For tabular ML, BigQuery ML or AutoML Tables style flows work well; for GenAI RAG, store documents in GCS/BigQuery and index in Vertex AI Vector Search. Keep data residency, encryption, and lineage documented (Data Catalog).

# Load from BigQuery in Python
import pandas_gbq
df = pandas_gbq.read_gbq("SELECT * FROM `proj.ds.customers` LIMIT 1000")

6) AutoML: Vision, Text, Tabular

AutoML trains models from labeled data without heavy ML engineering. Provide labeled datasets and Vertex chooses architectures/hypers. Evaluate with built-in metrics; export confusion matrices and feature importance for tabular. Use batch prediction or deploy to endpoints.

# CLI concept
gcloud ai custom-jobs create --region=us-central1 --display-name=automl-vision

7) Custom Training with GPUs/TPUs

Submit custom jobs using containers or prebuilt frameworks. Scale on multiple accelerators; log metrics to Cloud Logging/Monitoring. Package code with requirements and stage on GCS. Use Vertex AI Training or GKE for large-scale distributed training with TF/XLA or PyTorch DDP.

aip.CustomJob(
  display_name="trainer",
  worker_pool_specs=[{
    "machine_spec":{"machine_type":"a2-highgpu-1g","accelerator_type":"NVIDIA_TESLA_A100","accelerator_count":1},
    "replica_count":1,
    "container_spec":{"image_uri":"gcr.io/your/trainer:latest", "args":["--epochs","5"]}
  }]
).run()

8) Pipelines (KFP)

Vertex AI Pipelines (Kubeflow Pipelines on GCP) orchestrates reproducible ML workflows with lineage and caching. Define components, compile to a pipeline spec, and schedule. Artifacts, metadata, and metrics are tracked for compliance/audit.

from kfp import dsl
@dsl.component
def add(a:int,b:int) -> int: return a+b
@dsl.pipeline
def p(): add(2,3)
# Compile & upload via aiplatform.PipelineJob

9) Feature Store & Feast

Use Vertex AI Feature Store (v2 integrates with BigQuery) to manage offline/online features with consistency and low-latency serving. Define feature specs, ingest from BQ/Batch jobs, and serve to models via online stores. For OSS patterns, Feast can pair with Vertex components.

# Concept: define features then ingest from BigQuery scheduled queries

10) Q&A — “AutoML vs Custom Training?”

Answer: Use AutoML for quick, strong baselines and when you lack deep ML expertise. Choose Custom Training if you need full control over architectures, libraries, distributed strategies, or specialized loss functions. Many teams start with AutoML, then move to Custom for the last-mile gains.

Section 2 — Generative AI on Vertex (Gemini, Tuning, Safety, RAG)

11) Gemini Text & Multimodal

Gemini models (text, code, multimodal) power summarization, classification, content generation, tool use, and more. Use streaming for low-latency chat UIs and function calling for tool-augmented agents. Configure safety settings (harassment, hate, etc.) per use case.

from vertexai.preview.generative_models import GenerativeModel
model = GenerativeModel("gemini-1.5-flash")
resp = model.generate_content(["Explain in 3 bullet points:", "Vertex AI Pipelines"])
print(resp.text)

12) Tuning (Adapters/LoRA)

Parameter-efficient tuning lets you specialize foundation models with small domain datasets. Provide prompt-completion pairs or instruction datasets; evaluate with held-out metrics and human review. Adapters are applied at inference time, preserving base weights.

# Pseudocode: submit a tuning job referencing a GCS dataset of JSONL prompts

13) Embeddings & Similarity

Use embeddings to transform text/images into vectors for semantic search, clustering, and retrieval augmentation. Store vectors in Vertex AI Vector Search (managed ANN) or BigQuery Vector for analytics integration. Choose dimensionality and distance metric appropriately.

from vertexai.language_models import TextEmbeddingModel
emb = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")
vec = emb.get_embeddings(["Retrieval Augmented Generation"])[0].values

14) Vector Search

Vertex AI Vector Search offers low-latency approximate nearest neighbor search with filtering. Create indexes, upsert vectors, and query at runtime. Use for RAG, recommendations, and deduplication. Keep metadata (doc_id, source) for attribution and traceability.

# Concept: create index, upsert embeddings, then query with a vector and filter

15) RAG with Vertex

Implement RAG by chunking documents from GCS/BigQuery, generating embeddings, storing in Vector Search, and retrieving top-k contexts to ground prompts. Add citation links, metadata filters, and freshness signals. Cache retrievals to reduce cost/latency.

# Pseudocode: chunks -> embed -> upsert -> query -> prompt(model, context)

16) Tool Use & Function Calling

Define tools (functions with JSON schemas). The model can propose a tool call; your app executes it and returns results to the model for final completion. Useful for database lookup, web calls, and transactional flows with human-in-the-loop safeguards.

# Concept: tools=[{"name":"getWeather","schema":{...}}] passed to generate_content()

17) Safety, Data Governance & Grounding

Use safety filters, blocklists, PII redaction, and prompt hardening. Disable data logging if required, and configure CMEK/VPC-SC. For factual tasks, ground outputs via RAG and include citations. Add evals and human review queues for sensitive domains.

# Configure safety: parameters in generate_content(), server-side policies via console

18) Prompt Engineering

Structure prompts with role, instructions, constraints, examples, and format expectations. Use delimiters for context, ask for JSON output with a schema, and chain prompts for complex tasks. Add system prompts for style/voice and few-shot examples to steer behavior.

prompt = """You are a concise assistant.
Constraints: bullet list, 3 items.
Topic: Vertex AI safety controls."""

19) GenAI Evaluation

Evaluate with automatic metrics (BLEU/ROUGE for text, embedding similarity) and human ratings (helpfulness, harmlessness, honesty). Use golden sets and adversarial tests. Track drift, hallucination rates, and safety policy violations over time with dashboards.

# Store evals in BigQuery for analysis & dashboards

20) Q&A — “When to choose Gemini Flash vs Pro?”

Answer: Use Flash for low-latency, high-throughput tasks (UI autocomplete, quick summaries). Choose Pro for higher reasoning quality, complex instructions, or longer context. Benchmark your tasks—often a hybrid (Flash for previews, Pro for finalize) balances UX and cost.

Section 3 — MLOps: CI/CD, Monitoring, Lineage, Cost

21) CI/CD for Models

Adopt Git-based workflows with Cloud Build/GitHub Actions to build, test, and deploy models/pipelines. Store artifacts in Artifact Registry. Use environments (dev/stage/prod) with approvals and canaries. Version datasets, code, and model weights.

# Cloud Build step (concept)
gcloud ai models upload --display-name my-model --artifact-uri gs://bucket/model

22) Model Registry

Register models with metadata (version, metrics, lineage). Promote through stages, attach evaluation reports, and tie to endpoints. Enforce checks (schema, bias, safety) before production. Keep changelogs and rollback plans.

# Track versions and link to PipelineJob runs & datasets

23) Data & Model Lineage

Vertex ML Metadata records artifacts, executions, and contexts. This enables audits and reproducibility (“this model came from dataset X via pipeline Y”). Integrate with Data Catalog for dataset governance.

# Access lineage via Vertex console or Metadata APIs

24) Model Monitoring

Monitor prediction drift, data skew, and performance. Configure alerting via Cloud Monitoring. Capture request/response samples, compute feature statistics, and trigger re-training when drift exceeds thresholds. For GenAI, log prompt/response pairs for safety and quality reviews.

# Concept: enable logging/monitoring on Endpoint; export to BigQuery

25) Batch vs Online Prediction

Batch prediction is cost-efficient for large offline scoring jobs (e.g., nightly segments). Online endpoints serve latency-sensitive requests. Often both coexist: batch for bulk updates, online for personalization at request time. Keep model code identical across modes.

# Batch predict
aip.BatchPredictionJob.create(job_display_name="score", model=model.resource_name, instances_format="jsonl", gcs_source=...)

26) Cost Controls

Right-size machines/accelerators, enable autoscaling, and set quotas/budgets. Cache embeddings, reuse vector indexes, and stream responses. For pipelines, enable caching and shut down idle resources. Track per-project costs in BigQuery billing exports.

# Budgets & alerts in Cloud Billing; labels for cost allocation

27) Testing ML Systems

Unit test data transformations; integration test pipelines; canary test endpoints. Maintain golden datasets and backtesting harnesses. For GenAI, build red-team suites and toxicity/factuality tests. Automate in CI.

# pytest + sample JSONL prompts; assert structure & policy scores

28) Governance & Responsible AI

Document model cards, data sheets, and intended use. Apply bias checks, consent/logging controls, PII handling, and safety filters. Provide user controls for opt-out and human escalation. Record decisions for audits.

# Store governance artifacts in GCS with signed URLs for reviews

29) Hybrid & Private Networking

Access on-prem/private data via Private Service Connect/VPC-SC. Place endpoints in regions near users/data. Use CMEK for encryption and restrict egress with Cloud NAT + firewall rules. For strict environments, add approval gates.

# VPC-SC perimeter with restricted services & projects

30) Q&A — “How do I trigger retraining safely?”

Answer: Monitor drift and performance thresholds; when breached, kick off a PipelineJob that validates data schema, runs training, evaluates against golden sets, compares against champion, and only promotes if it beats guardrails. Use canary deployments with shadow traffic before full cutover.

Section 4 — Integrations, Data Patterns, Examples

31) BigQuery ML vs Vertex AI

BQML trains models directly in SQL (linear, boosted trees, deep nets, ARIMA, XGBoost). Great for analysts and fast iteration. Vertex AI is better for custom training, GenAI, feature stores, vector search, and full MLOps. Combine them: train in BQML, serve via Vertex endpoints if needed.

CREATE OR REPLACE MODEL ds.churn OPTIONS(MODEL_TYPE='LOGISTIC_REG') AS SELECT ...;

32) Dataflow & ETL

Use Dataflow (Apache Beam) for scalable ETL/feature pipelines. Stream from Pub/Sub to BigQuery and Feature Store; embed featurization and windowing. Keep schemas versioned and test with synthetic data.

# Python Beam skeleton reading Pub/Sub and writing to BQ

33) Pub/Sub for Real-time Scoring

Publish events to Pub/Sub; trigger Cloud Functions/Run that call Vertex endpoints. Add retries with dead-letter topics, idempotency, and timeout budgets. Log request IDs for traceability.

# Cloud Run handler calls endpoint.predict(payload)

34) Cloud Run GenAI APIs

Wrap Gemini calls in a Cloud Run microservice with auth, rate limits, and caching. Stream SSE to the frontend for typing-effect UX. Keep system prompts in Config; rotate keys and apply quotas per tenant.

# Flask/FastAPI + vertexai SDK + streaming response

35) Document AI + RAG

Extract text/structure with Document AI, chunk and index in Vector Search, then build a Gemini-powered Q&A over your PDFs. Keep provenance links and confidence scores; redact PII if needed.

# Pipeline: GCS -> DocAI -> chunks -> embeddings -> index

36) Images & Vision APIs

Use Vertex Vision models for classification/detection or tune for your labels. For generative images, call the appropriate model endpoints (when available in your region). Store prompts/outputs for audit and safety review.

# Upload labeled images to GCS, start AutoML Vision training

37) Code Assist & Agents

Build internal code assistants with Gemini Code models, function calling, and repository retrieval. Add guardrails (no secrets), explain diffs, and propose patches. For support bots, combine conversational state with RAG and action tools (ticket systems).

# Tool-enabled chat pipeline with user/session context

38) Evaluations at Scale

Run batch evals in Pipelines against curated prompts. Capture metrics (accuracy, toxicity, refusal rates), store in BigQuery, and visualize in Looker Studio. Automate regression checks on every model or prompt change.

# Pipeline step writes eval JSONL to BigQuery for dashboards

39) Multi-Region & DR

Choose regions close to data/users; replicate artifacts and indexes. Use separate projects for isolation, per-env service accounts, and org policies. Test failovers and rate-limit fallbacks in client apps.

# Artifact Registry mirrors; dual endpoints with traffic split

40) Q&A — “How to secure GenAI endpoints?”

Answer: Enforce IAM, private networking (PSC), request auth (ID tokens), quotas, and per-tenant limits. Apply safety filters, prompt wrapping, output validation (JSON schemas), and PII redaction. Log prompts/outputs with data retention controls and build abuse monitoring.

Section 5 — Cheats, Pitfalls, Interview Q&A

41) Quickstart: Create Endpoint & Predict (Python)

Initialize, upload, deploy, predict; tear down when done to save cost. Ensure service account has aiplatform.user and storage access; set region explicitly.

from google.cloud import aiplatform as aip
aip.init(project="P", location="us-central1")
m = aip.Model.upload(display_name="clf", artifact_uri="gs://bucket/model")
ep = aip.Endpoint.create(display_name="clf-ep")
m.deploy(endpoint=ep, machine_type="n1-standard-2")
print(ep.predict(instances=[{"x":1,"y":2}]).predictions)

42) Quickstart: Gemini REST (curl)

Call the generative endpoint via REST with OAuth access token. Prefer server-to-server calls; never expose tokens in the browser. Stream when building chats.

curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  https://.../projects/PROJECT/locations/us-central1/publishers/google/models/gemini-1.5-pro:generateContent \
  -d '{"contents":[{"role":"user","parts":[{"text":"Explain Vertex AI Pipelines"}]}]}'

43) Prompt Template Pattern

Keep prompts in files with placeholders; inject variables server-side. Version prompts and track A/B results. For JSON outputs, validate against a schema and retry with error-aware prompts.

template = """Role: helpful assistant.
Output JSON with keys: steps[], risks[].
Topic: {topic}"""
prompt = template.format(topic="Model Monitoring")

44) RAG Chunking & Metadata

Chunk by semantic boundaries, store title/section/source_url, and use hybrid retrieval (BM25 + vectors). Re-rank candidates before prompting. Add citations in the final answer and cache results.

# Store metadata alongside vectors: {"doc_id": "...", "section": "...", "url": "..."}

45) Latency Tactics

Choose closer region, use “flash” models when feasible, enable streaming, cache embeddings, reuse HTTP connections, and pre-warm endpoints. For pipelines, use caching and parallelism; avoid tiny batch sizes.

# HTTP keep-alive + connection pooling in your client

46) Cost Tactics

Batch non-urgent requests, cap max tokens, set per-user quotas, and use retrieval caches. Downshift model tiers for drafts and upgrade for finalization. Delete unused endpoints and indexes.

# Track cost by labels: project, team, app, environment

47) Common Pitfalls

Forgetting region in SDK calls, mixing projects/SA scopes, leaving endpoints running, no safety filters, missing retries/timeouts, and unbounded prompt sizes. Fix with client wrappers, guardrails, and budgets.

# Always set: aiplatform.init(project="...", location="...")

48) Production Checklist

IAM + network controls, logging/metrics/traces, eval gates, canary rollouts, budget alerts, data retention, incident runbooks, and continuous red-teaming.

# Cloud Monitoring alerts: p95 latency, error rate, token spend

49) Reference Patterns

1) GenAI Chat with tools + RAG. 2) Classification endpoint + batch scoring. 3) Image AutoML + online detection. 4) Recommender with embeddings + vector search. 5) Code assistant with repo retrieval.

# Start simple, measure, iterate; codify patterns as reusable microservices

50) Interview Q&A — 20 Practical Questions (Expanded)

1) Vertex AI vs BQML? — BQML for SQL-native modeling; Vertex for custom training, GenAI, vector search, and MLOps.