Run LLM inference on Cloud Run GPUs with vLLM (services)
Stay organized with collections
Save and categorize content based on your preferences.
The following codelab shows how to run a backend service that runs vLLM, which is an
inference engine for production systems, along with Google's Gemma 2, which is
a 2 billion parameters instruction-tuned model.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2024-10-30 UTC."],[],[]]