這份文件位於架構完善架構:AI 和機器學習觀點,概述了相關原則和建議,可協助您在 Google Cloud上最佳化 AI 和機器學習工作負載的效能。本文中的建議與 Google Cloud 架構完善架構的效能最佳化支柱一致。
AI 和機器學習系統可為貴機構提供全新的自動化和決策功能。這些系統的成效會直接影響收益、成本和顧客滿意度等業務驅動因素。如要充分發揮 AI 和機器學習系統的潛力,您需要根據業務目標和技術需求,盡可能提升系統效能。效能最佳化程序通常需要進行取捨。舉例來說,如果設計選擇能提供所需效能,但會導致成本增加,本文中的最佳化建議以成效為優先考量,而非費用等其他因素。
如要提升 AI 和 ML 效能,您需要針對模型架構、參數和訓練策略等因素做出決策。做出這些決策時,請考量 AI 和機器學習系統的整個生命週期及其部署環境。舉例來說,大型 LLM 在大規模訓練基礎架構中可能表現優異,但在行動裝置等容量受限的環境中,大型模型可能無法發揮良好效能。
將業務目標轉換為成效目標
如要做出可提升效能的架構決策,請先明確設定業務目標,然後設計 AI 和機器學習系統,提供支援業務目標和優先事項所需的技術效能。技術團隊必須瞭解成效目標與業務目標之間的對應關係。
請參考下列建議:
將業務目標轉化為技術需求:
將 AI 和機器學習系統的業務目標轉化為具體的技術效能需求,並評估未達到這些需求所造成的影響。舉例來說,如果應用程式會預測客戶流失,機器學習模型應在準確率和喚回率等標準指標方面表現良好,且應用程式應符合低延遲等運算需求。
[[["容易理解","easyToUnderstand","thumb-up"],["確實解決了我的問題","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["難以理解","hardToUnderstand","thumb-down"],["資訊或程式碼範例有誤","incorrectInformationOrSampleCode","thumb-down"],["缺少我需要的資訊/範例","missingTheInformationSamplesINeed","thumb-down"],["翻譯問題","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["上次更新時間:2024-10-11 (世界標準時間)。"],[[["\u003cp\u003eThis document outlines principles and recommendations for optimizing the performance of AI and ML workloads on Google Cloud, aligning with the performance optimization pillar of the Well-Architected Framework.\u003c/p\u003e\n"],["\u003cp\u003eOptimizing AI and ML system performance is crucial for business drivers like revenue and customer satisfaction, requiring decisions on model architecture, parameters, and training strategies throughout the entire lifecycle.\u003c/p\u003e\n"],["\u003cp\u003eTranslating business objectives into specific technical requirements, monitoring performance at all stages, and automating evaluations are essential for effective AI/ML performance optimization.\u003c/p\u003e\n"],["\u003cp\u003eBuilding a dedicated experimentation environment, embedding experimentation into the company's culture, and leveraging AI-specialized components for training and prediction are key to successful AI/ML development.\u003c/p\u003e\n"],["\u003cp\u003eLinking performance metrics to design choices and configurations through a data and model lineage system, alongside using explainability tools, is vital for innovating and improving model performance.\u003c/p\u003e\n"]]],[],null,["# AI and ML perspective: Performance optimization\n\nThis document in the\n[Well-Architected Framework: AI and ML perspective](/architecture/framework/perspectives/ai-ml)\nprovides an overview of principles and recommendations to help you to optimize\nthe performance of your AI and ML workloads on Google Cloud. The\nrecommendations in this document align with the\n[performance optimization pillar](/architecture/framework/performance-optimization)\nof the Google Cloud Well-Architected Framework.\n\nAI and ML systems enable new automation and decision-making capabilities for\nyour organization. The performance of these systems can directly affect your\nbusiness drivers like revenue, costs, and customer satisfaction. To realize the\nfull potential of your AI and ML systems, you need to optimize their performance\nbased on your business goals and technical requirements. The performance\noptimization process often involves certain trade-offs. For example, a design\nchoice that provides the required performance might lead to higher costs. The\nrecommendations in this document prioritize performance over other\nconsiderations like costs.\n\nTo optimize AI and ML performance, you need to make decisions regarding factors\nlike the model architecture, parameters, and training strategy. When you make\nthese decisions, consider the entire lifecycle of the AI and ML systems and\ntheir deployment environment. For example, LLMs that are very large can be\nhighly performant on massive training infrastructure, but very large models\nmight not perform well in capacity-constrained environments like mobile\ndevices.\n\nTranslate business goals to performance objectives\n--------------------------------------------------\n\nTo make architectural decisions that optimize performance, start with a clear\nset of business goals. Design AI and ML systems that provide the technical\nperformance that's required to support your business goals and priorities. Your\ntechnical teams must understand the mapping between performance objectives and\nbusiness goals.\n\nConsider the following recommendations:\n\n- **Translate business objectives into technical requirements** : Translate the business objectives of your AI and ML systems into specific technical performance requirements and assess the effects of not meeting the requirements. For example, for an application that predicts customer churn, the ML model should perform well on standard metrics, like accuracy and recall, *and* the application should meet operational requirements like low latency.\n- **Monitor performance at all stages of the model lifecycle**: During experimentation and training after model deployment, monitor your key performance indicators (KPIs) and observe any deviations from business objectives.\n- **Automate evaluation to make it reproducible and standardized**: With a standardized and comparable platform and methodology for experiment evaluation, your engineers can increase the pace of performance improvement.\n\nRun and track frequent experiments\n----------------------------------\n\nTo transform innovation and creativity into performance improvements, you need\na culture and a platform that supports experimentation. Performance improvement\nis an ongoing process because AI and ML technologies are developing\ncontinuously and quickly. To maintain a fast-paced, iterative process, you\nneed to separate the experimentation space from your training and serving\nplatforms. A standardized and robust experimentation process is important.\n\nConsider the following recommendations:\n\n- **Build an experimentation environment**: Performance improvements require a dedicated, powerful, and interactive environment that supports the experimentation and collaborative development of ML pipelines.\n- **Embed experimentation as a culture** : Run experiments before any production deployment. Release new versions iteratively and always collect performance data. Experiment with different data types, feature transformations, algorithms, and [hyperparameters](https://developers.google.com/machine-learning/glossary#hyperparameter).\n\nBuild and automate training and serving services\n------------------------------------------------\n\nTraining and serving AI models are core components of your AI services. You\nneed robust platforms and practices that support fast and reliable creation,\ndeployment, and serving of AI models. Invest time and effort to create\nfoundational platforms for your core AI training and serving tasks. These\nfoundational platforms help to reduce time and effort for your teams and improve\nthe quality of outputs in the medium and long term.\n\nConsider the following recommendations:\n\n- **Use AI-specialized components of a training service**: Such components include high-performance compute and MLOps components like feature stores, model registries, metadata stores, and model performance-evaluation services.\n- **Use AI-specialized components of a prediction service**: Such components provide high-performance and scalable resources, support feature monitoring, and enable model performance monitoring. To prevent and manage performance degradation, implement reliable deployment and rollback strategies.\n\nMatch design choices to performance requirements\n------------------------------------------------\n\nWhen you make design choices to improve performance, carefully assess whether\nthe choices support your business requirements or are wasteful and\ncounterproductive. To choose the appropriate infrastructure, models, or\nconfigurations, identify performance bottlenecks and assess how they're linked\nto your performance measures. For example, even on very powerful GPU\naccelerators, your training tasks can experience performance bottlenecks due to\ndata I/O issues from the storage layer or due to performance limitations of the\nmodel itself.\n\nConsider the following recommendations:\n\n- **Optimize hardware consumption based on performance goals**: To train and serve ML models that meet your performance requirements, you need to optimize infrastructure at the compute, storage, and network layers. You must measure and understand the variables that affect your performance goals. These variables are different for training and inference.\n- **Focus on workload-specific requirements**: Focus your performance optimization efforts on the unique requirements of your AI and ML workloads. Rely on managed services for the performance of the underlying infrastructure.\n- **Choose appropriate training strategies**: Several pre-trained and foundational models are available, and more such models are released often. Choose a training strategy that can deliver optimal performance for your task. Decide whether you should build your own model, tune a pre-trained model on your data, or use a pre-trained model API.\n- **Recognize that performance-optimization strategies can have\n diminishing returns**: When a particular performance-optimization strategy doesn't provide incremental business value that's measurable, stop pursuing that strategy.\n\nLink performance metrics to design and configuration choices\n------------------------------------------------------------\n\nTo innovate, troubleshoot, and investigate performance issues, establish a\nclear link between design choices and performance outcomes. In addition to\nexperimentation, you must reliably record the lineage of your assets,\ndeployments, model outputs, and the configurations and inputs that produced the\noutputs.\n\nConsider the following recommendations:\n\n- **Build a data and model lineage system**: All of your deployed assets and their performance metrics must be linked back to the data, configurations, code, and the choices that resulted in the deployed systems. In addition, model outputs must be linked to specific model versions and how the outputs were produced.\n- **Use explainability tools to improve model performance**: Adopt and standardize tools and benchmarks for model exploration and explainability. These tools help your ML engineers understand model behavior and improve performance or remove biases.\n\nContributors\n------------\n\nAuthors:\n\n- [Benjamin Sadik](https://www.linkedin.com/in/benjaminhaimsadik) \\| AI and ML Specialist Customer Engineer\n- [Filipe Gracio, PhD](https://www.linkedin.com/in/filipegracio) \\| Customer Engineer, AI/ML Specialist\n\n\u003cbr /\u003e\n\nOther contributors:\n\n- [Kumar Dhanagopal](https://www.linkedin.com/in/kumardhanagopal) \\| Cross-Product Solution Developer\n- [Marwan Al Shawi](https://www.linkedin.com/in/marwanalshawi) \\| Partner Customer Engineer\n- [Zach Seils](https://www.linkedin.com/in/zachseils) \\| Networking Specialist\n\n\u003cbr /\u003e"]]