18. 硬體檢查後

預計完成時間:2 小時

可操作元件擁有者:OELCM

技能設定檔:部署工程師

18.1. 檢查設定

如要確認 HPE 交付的 Google Distributed Cloud (GDC) 實體隔離硬體和軟體資產品質、安全性及效力,確保這些資產已準備好用於實際工作環境,請使用 Distributed Cloud 發布版本的驗證 CLI。

驗證套件會測試裝置的健康狀態、安裝和設定,並驗證伺服器、網路交換器、檔案/ 區塊儲存空間、物件儲存空間、防火牆和 HSM 等。

如要驗證硬體,請完成下列步驟:

  1. 在啟動程序機器上,以根存取權 sudo 執行驗證 CLI 指令:

    sudo RELEASE_DIR/gdcloud system check-config --config CELL_CONFIG_PATH --artifacts-directory ARTIFACTS_DIR --scenario ConfigCheck
    

    這個指令會記錄 ARTIFACTS_DIR 中的所有記錄。

  2. 如果發現任何錯誤,請根據錯誤訊息修正所有問題。重新執行驗證。

  3. 如果所有報告都顯示正常,請繼續下一個步驟。

18.2. 潛在問題

本節列出在 Distributed Cloud 執行個體安裝後驗證時,可能遇到的問題。

18.2.1. 所有 Google Distributed Cloud 版本中的潛在問題

18.2.1.1. 網路檢查會錯誤地將連接至配線架的儲存裝置標示為有問題

問題

檢查失敗,摘要文字為:Storage network connection mismatched

詳細資料文字如下所示:

Got: xx-ab-stge01-01:e0g<>xx-ab-torsw02 (:::::):Ethernet1/1/1, want: expected: xx-ab-stge01-01:e0g<>xx-ab-ppl01:r04Ap01BO-ft

主要症狀是檢查的第二部分,包含某種配線盤標籤,例如 r04Ap01BO-ft

解決辦法

assets/inv/inv-core.yaml 檔案中,手動檢查 CR 儲存格:

使用失敗示例:Got: xx-ab-stge01-01:e0g<>xx-ab-torsw02 (:::::):Ethernet1/1/1, want: expected: xx-ab-stge01-01:e0g<>xx-ab-ppl01:r04Ap01BO-ft

  • 確認有名為儲存裝置和配線架的項目。

舉例來說,xx-ab-stge01-01:e0g<>xx-ab-ppl01:r04Ap01BO-ft 會變成:

        - cableType: MMF
          color: Aqua
          endA: xx-ab-stge01-01:e0g
          endATransceiverMPN: X65404-N-C
          endB: xx-ab-ppl01:r04Ap01BO-ft
          length: 2m
          mpn: 'OM4LCDX #40220 (2m)'
  • 確認對應的配線架連結至具名 torswitch。

如要找出配線架的另一側,請使用 r04Ap01BO-ft,並將第一部分 (含 r 和數字) 的 -ft 變更為 -bk。r04Ap01BO-ftr04Ap02BO-ft 對應至 r04Ap01BO-bk

        - cableType: MMF
          color: Magenta
          endA: xx-ab-torsw02:Eth1/1
          endATransceiverMPN: QSFP-100G-SL4
          endB: xx-ab-ppl01:r04Ap01BO-bk
          length: 1.5m
          mpn: '12FMTPOM4 #73704 (1.5m)'
          notes: 25Gb breakout

電纜入口的另一端應與檢查的第一部分相符,在本例中為:

乙太網路 1/1/1 表示實體連接埠 1 上的 torsw02 是透過分接盒連接至第一個分接頭。

如果對應關係正確無誤,可以忽略這項檢查。

18.2.1.2. 物件儲存空間網站發生對帳錯誤 (DNS 後置字串錯誤)

問題

ObjectStorageSite 自訂資源設為 Ready: false,且其記錄會回報 Reconcile error, retrying: failed to parse location, found malformed DNSSuffix

解決辦法

請忽略這些錯誤。安裝程序完成「根管理員叢集啟動」步驟後,這些檔案就會消失。

18.2.1.3. 根管理員叢集的不含作業系統機器設定有誤

驗證輸出內容中的失敗範例:

- passed: false
  description: |-
    BMM setting validation on server xx-yy-bm01 failed with error:
    server has unexpected settings:
    /redfish/v1/Systems/1/SecureBoot.SecureBootEnable is true, want false
  target: xx-yy-bm01
  targettype: ServerSettings
  vendorerrorcode: SERVER_TEST_FAIL(0x04)
  gpcerrorcode: FailedInBMMSetting
  mitigation: Refer to the artifact to see which server flags. Check the connection
    to the server iLO port. Check the account of iLO. Check if the iLO and server
    are fully powered up. Check the concerned settings of server ah-ab-bm01.

18.2.1.4. 插線面板不符

問題

硬體檢查應以連線尾端的裝置為目標,而非直接連線的裝置 (xx-xx-ppl)。

範例

- description: This check validates the storage network connection against the cell
    configuration.
  target: xx-yy-stge01-01:e0e<>xx-yy-torsw01 (aa:aa:aa:aa:aa:aa):Ethernet1/1/1
  targettype: ""
  checkresult:
    passed: false
    summary: Storage network connection mismatched.
    detail: 'Got: xx-yy-stge01-01:e0e<>xx-yy-torsw01 (aa:aa:aa:aa:aa:aa):Ethernet1/1/1,
      want: expected: xx-yy-stge01-01:e0e<>xx-yy-ppl01:r03Ap01BO-ft'
    vendorerrorcode: ""
    errorcode: VAL-E3026
    mitigation: If this check fails, it can indicate that the Storage system is not
      configurated according to the configuration file. Adjust the cabling so it matches
      with the cell configuration.

解決辦法

忽略錯誤。

18.2.1.5. Ping 測試失敗

問題

這是 CDP 的生理行為,因為必須發生 ARP 洪流,才能在交換器上填入 CAM 表格,並連線至裝置。前 1 到 5 個封包預計有很高的機率會遺失。

範例

- description: This check validates the link quality from the management switches
    to other switches and baremetal node by measuring the packet delivery ratio of
    100 ping requests.
  target: xx-yy-mgmtsw01
  targettype: ManagementSwitch
  checkresult:
    passed: false
    summary: Link quality from ManagementSwitch to other devices is degraded.
    detail: |-
      Check the cable connections of management switch xx-yy-mgmtsw01.
      Error:
      ping test failed on link xx-yy-mgmtsw01:Eth1/52<>xx-yz-mgmtaggsw01:Eth1/1 with 1 packets dropped in 100 packets send
      ping test failed on link xx-yy-mgmtsw01:Eth1/32<>xx-yy-aggsw01:mgmt0 with 1 packets dropped in 100 packets send
      ping test failed on link xx-yy-mgmtsw01:Eth1/36<>xx-yy-mgmtaggsw01:mgmt0 with 1 packets dropped in 100 packets send
      ping test failed on link xx-yy-mgmtsw01:Eth1/41<>xx-yy-torsw02:mgmt0 with 1 packets dropped in 100 packets send
      ping test failed on link xx-yy-mgmtsw01:Eth1/42<>xx-yy-torsw01:mgmt0 with 1 packets dropped in 100 packets send
      ping test failed on link xx-yy-mgmtsw01:Eth1/51<>xx-yy-mgmtaggsw01:Eth1/1 with 1 packets dropped in 100 packets send
      ping test failed on link xx-yy-mgmtsw01:Eth1/45<>xx-yy-base02:ilo with 1 packets dropped in 100 packets send
      ping test failed on link xx-yy-mgmtsw01:Eth1/46<>xx-yy-base03:ilo with 1 packets dropped in 100 packets send
      ping test failed on link xx-yy-mgmtsw01:Eth1/24<>xx-yy-base03:LOM1 with 1 packets dropped in 100 packets send.
    vendorerrorcode: SWITCH_TEST_FAIL(0x01)
    errorcode: VAL-E1003
    mitigation: If this check failed, it usually means the network cables from the
      management switch need to be inspected or replaced. Check the artifacts directory
      or stdout to see which cable flagged.

解決辦法

忽略錯誤。

18.2.1.6. 檢查 ONTAP 儲存空間叢集名稱

問題

自動化程序會尋找 ONTAP 裝置主機名稱,但 ONTAP 裝置在交換器上顯示的卻是序號。

範例

- description: This check validates the storage cluster name and management interface
    are consistent between netapp ontap client and the cell configuration.
  target: yy-stge-clus-01
  targettype: StorageCluster
  checkresult:
    passed: false
    summary: StorageCluster management interface cannot be found.
    detail: StorageCluster management interface x.x.x.x in the cell configuration
      cannot be found in the netapp ontap client.
    vendorerrorcode: STORAGE_TEST_FAIL(0x03)
    errorcode: VAL-E3007
    mitigation: Review if management IPfor StorageCluster yy-stge-clus-01 in the cell
      configuration is correct.

解決辦法

忽略錯誤。

18.2.1.7. Bootstrapper LLDP Discovery Fail

問題

show lldp neighbors 無法從 TOR 交換器找到啟動程式。這似乎是因為啟動程式 (Ubuntu) 上的 OS 不會回應 LLDP 要求。

範例

- description: This check validates the connection between TorSwitch and Server. The
    connection is retriveved via "show lldp neighbors" and cross check with the MAC
    address for NIC port from Server defined in the cell configuration.
  target: xx-yy-torsw02
  targettype: TORSwitch
  checkresult:
    passed: false
    summary: Connection between TorSwitch and Server does not match with the cell
      configuration.
    detail: |-
      Check the cable connection between TorSwitch and Server.
      Error:
      the BM server port xx-yy-bm15:s1p2 could not be found in the rack. Check if the server xx-yy-bm15 is powered up. If the server is powered up, check th
e cell.yaml file to see if the connection to switch port xx-yy-torsw02:Eth1/10/2 comply with the rack mount
    vendorerrorcode: SWITCH_TEST_FAIL(0x01)
    errorcode: VAL-E1001
    mitigation: If this check failed, it usually means the connection from TorSwitch
      to Server does not match the cell configuration. Or the Server has the wrong
      MAC address for NIC port in the cell configuration. Check the artifacts directory
      or stdout to see which connection flagged.

解決辦法

請確保已使用 show mac address-table,從 TOR 交換器設定與啟動程式的連線。