Failure modes

A failure mode is an incorrect application state that prompts an alert. The application must recover from a failure mode to run successfully. For example, the system prompts an alert when the AI pre-trained APIs aren't ready for use and exceed the designated enable time limit. If a failure mode occurs and the application cannot recover, contact your Infrastructure Operator for help.

The following failure modes (FMs) might occur and prompt an alert:

Service readiness failures

The service readiness failures occur because of one of the following FMs:

  • FM1 - Unable to schedule workloads: One or more of the AI service workloads cannot be scheduled due to the lack of resources such as GPU, memory, or some other error.
  • FM3 - Unable to configure components: One of the required components of an AI service cannot be configured or created because of incorrect permissions or other issues. Those components are, for example, DNS or Ingress.
  • FM4 - Services not reaching the Enabled status: The pre-trained services cannot become ready after prompting the enablement process. The page displays the Enabling status for one or more services and, possibly, the AI infrastructure without changing to the Enabled status.

User interface failures

The user interface failures occur because of one of the following FMs:

  • Frontend and backend communication failure: The page displays an error message showing issues with backend communication. Error log entries have codes from AIPL0500 to AIPL0502.
  • Service API endpoints aren't displayed on the page: If there is an error, the page shows the Unable to fetch the endpoint message instead of the endpoint.