18. 硬件检查后

预计完成时间:2 小时

可操作组件的所有者:OELCM

技能配置文件:部署工程师

18.1. 配置检查

如需确认 HPE 交付的 Google Distributed Cloud (GDC) 气隙硬件和软件资产的质量、安全性和有效性,以确保其已准备好投入生产,请使用 Distributed Cloud 发行版的验证 CLI。

验证套件可测试设备的运行状况、安装和配置,并包含用于验证服务器、网络交换机、文件/ 块存储、对象存储、防火墙和 HSM 的测试(仅举几例)。

如需验证硬件,请完成以下步骤:

  1. 在引导加载程序机器上运行具有 root 访问权限的验证 CLI 命令 sudo

    sudo RELEASE_DIR/gdcloud system check-config --config CELL_CONFIG_PATH --artifacts-directory ARTIFACTS_DIR --scenario ConfigCheck
    

    此命令会记录 ARTIFACTS_DIR 中的所有日志。

  2. 如果发现任何错误,请根据错误消息解决所有问题。重新运行验证。

  3. 如果所有报告的健康状况都良好,请继续执行下一步。

18.2. 潜在问题

本部分包含在对 Distributed Cloud 实例执行安装后验证时可能遇到的潜在问题。

18.2.1. 所有 Google Distributed Cloud 版本中的潜在问题

18.2.1.1. 网络检查错误地标记了连接到配线架的存储设备

问题

检查失败,并显示摘要文本:Storage network connection mismatched

详细文本如下所示:

Got: xx-ab-stge01-01:e0g<>xx-ab-torsw02 (:::::):Ethernet1/1/1, want: expected: xx-ab-stge01-01:e0g<>xx-ab-ppl01:r04Ap01BO-ft

关键症状是检查的第二部分包含某种配线架标签,例如 r04Ap01BO-ft

解决方法

assets/inv/inv-core.yaml 文件中找到的单元格 CR 中进行手动检查:

使用示例故障:Got: xx-ab-stge01-01:e0g<>xx-ab-torsw02 (:::::):Ethernet1/1/1, want: expected: xx-ab-stge01-01:e0g<>xx-ab-ppl01:r04Ap01BO-ft

  • 确认存在具有命名存储设备和配线架的条目。

例如:xx-ab-stge01-01:e0g<>xx-ab-ppl01:r04Ap01BO-ft 变为:

        - cableType: MMF
          color: Aqua
          endA: xx-ab-stge01-01:e0g
          endATransceiverMPN: X65404-N-C
          endB: xx-ab-ppl01:r04Ap01BO-ft
          length: 2m
          mpn: 'OM4LCDX #40220 (2m)'
  • 确认映射配线架链接到指定的 torswitch。

您可以通过以下方式找到配线架的另一侧:获取 r04Ap01BO-ft,然后获取包含 r 和数字的第一部分,并将 -ft 更改为 -bk。r04Ap01BO-ftr04Ap02BO-ft 映射到 r04Ap01BO-bk

        - cableType: MMF
          color: Magenta
          endA: xx-ab-torsw02:Eth1/1
          endATransceiverMPN: QSFP-100G-SL4
          endB: xx-ab-ppl01:r04Ap01BO-bk
          length: 1.5m
          mpn: '12FMTPOM4 #73704 (1.5m)'
          notes: 25Gb breakout

线缆入口的另一端应与检查的第一部分相匹配,在本例中:

以太网 1/1/1 表示 torsw02 上的物理端口 1 通过分线盒连接到第一个分线。

如果映射看起来正确,您可以忽略此检查。

18.2.1.2. 对象存储网站上的对账错误(DNS 后缀错误)

问题

ObjectStorageSite 自定义资源设置为 Ready: false,其日志报告 Reconcile error, retrying: failed to parse location, found malformed DNSSuffix

解决方法

忽略错误。在安装过程中的“根管理员集群引导”步骤完成后,这些文件会消失。

18.2.1.3. 根管理员集群的裸机设置不正确

验证输出中的失败示例:

- passed: false
  description: |-
    BMM setting validation on server xx-yy-bm01 failed with error:
    server has unexpected settings:
    /redfish/v1/Systems/1/SecureBoot.SecureBootEnable is true, want false
  target: xx-yy-bm01
  targettype: ServerSettings
  vendorerrorcode: SERVER_TEST_FAIL(0x04)
  gpcerrorcode: FailedInBMMSetting
  mitigation: Refer to the artifact to see which server flags. Check the connection
    to the server iLO port. Check the account of iLO. Check if the iLO and server
    are fully powered up. Check the concerned settings of server ah-ab-bm01.

18.2.1.4. 插线板不匹配

问题

硬件检查应以连接末端的设备为目标,而不是直接连接的设备 (xx-xx-ppl)。

示例

- description: This check validates the storage network connection against the cell
    configuration.
  target: xx-yy-stge01-01:e0e<>xx-yy-torsw01 (aa:aa:aa:aa:aa:aa):Ethernet1/1/1
  targettype: ""
  checkresult:
    passed: false
    summary: Storage network connection mismatched.
    detail: 'Got: xx-yy-stge01-01:e0e<>xx-yy-torsw01 (aa:aa:aa:aa:aa:aa):Ethernet1/1/1,
      want: expected: xx-yy-stge01-01:e0e<>xx-yy-ppl01:r03Ap01BO-ft'
    vendorerrorcode: ""
    errorcode: VAL-E3026
    mitigation: If this check fails, it can indicate that the Storage system is not
      configurated according to the configuration file. Adjust the cabling so it matches
      with the cell configuration.

解决方法

忽略这些错误。

18.2.1.5. Ping 测试失败

问题

这是 CDP 生理行为,因为需要进行 ARP 泛洪才能在交换机上填充 CAM 表,并能够访问设备。预计前 1-5 个数据包被丢弃的几率较高。

示例

- description: This check validates the link quality from the management switches
    to other switches and baremetal node by measuring the packet delivery ratio of
    100 ping requests.
  target: xx-yy-mgmtsw01
  targettype: ManagementSwitch
  checkresult:
    passed: false
    summary: Link quality from ManagementSwitch to other devices is degraded.
    detail: |-
      Check the cable connections of management switch xx-yy-mgmtsw01.
      Error:
      ping test failed on link xx-yy-mgmtsw01:Eth1/52<>xx-yz-mgmtaggsw01:Eth1/1 with 1 packets dropped in 100 packets send
      ping test failed on link xx-yy-mgmtsw01:Eth1/32<>xx-yy-aggsw01:mgmt0 with 1 packets dropped in 100 packets send
      ping test failed on link xx-yy-mgmtsw01:Eth1/36<>xx-yy-mgmtaggsw01:mgmt0 with 1 packets dropped in 100 packets send
      ping test failed on link xx-yy-mgmtsw01:Eth1/41<>xx-yy-torsw02:mgmt0 with 1 packets dropped in 100 packets send
      ping test failed on link xx-yy-mgmtsw01:Eth1/42<>xx-yy-torsw01:mgmt0 with 1 packets dropped in 100 packets send
      ping test failed on link xx-yy-mgmtsw01:Eth1/51<>xx-yy-mgmtaggsw01:Eth1/1 with 1 packets dropped in 100 packets send
      ping test failed on link xx-yy-mgmtsw01:Eth1/45<>xx-yy-base02:ilo with 1 packets dropped in 100 packets send
      ping test failed on link xx-yy-mgmtsw01:Eth1/46<>xx-yy-base03:ilo with 1 packets dropped in 100 packets send
      ping test failed on link xx-yy-mgmtsw01:Eth1/24<>xx-yy-base03:LOM1 with 1 packets dropped in 100 packets send.
    vendorerrorcode: SWITCH_TEST_FAIL(0x01)
    errorcode: VAL-E1003
    mitigation: If this check failed, it usually means the network cables from the
      management switch need to be inspected or replaced. Check the artifacts directory
      or stdout to see which cable flagged.

解决方法

忽略这些错误。

18.2.1.6. ONTAP 存储集群名称检查

问题

自动化操作正在查找 ONTAP 设备主机名,而 ONTAP 设备在交换机上显示为序列号。

示例

- description: This check validates the storage cluster name and management interface
    are consistent between netapp ontap client and the cell configuration.
  target: yy-stge-clus-01
  targettype: StorageCluster
  checkresult:
    passed: false
    summary: StorageCluster management interface cannot be found.
    detail: StorageCluster management interface x.x.x.x in the cell configuration
      cannot be found in the netapp ontap client.
    vendorerrorcode: STORAGE_TEST_FAIL(0x03)
    errorcode: VAL-E3007
    mitigation: Review if management IPfor StorageCluster yy-stge-clus-01 in the cell
      configuration is correct.

解决方法

忽略这些错误。

18.2.1.7. 引导加载程序 LLDP 发现失败

问题

show lldp neighbors 无法从 TOR 交换机找到引导加载程序。这似乎是因为引导加载程序 (Ubuntu) 上的操作系统不响应 LLDP 请求。

示例

- description: This check validates the connection between TorSwitch and Server. The
    connection is retriveved via "show lldp neighbors" and cross check with the MAC
    address for NIC port from Server defined in the cell configuration.
  target: xx-yy-torsw02
  targettype: TORSwitch
  checkresult:
    passed: false
    summary: Connection between TorSwitch and Server does not match with the cell
      configuration.
    detail: |-
      Check the cable connection between TorSwitch and Server.
      Error:
      the BM server port xx-yy-bm15:s1p2 could not be found in the rack. Check if the server xx-yy-bm15 is powered up. If the server is powered up, check th
e cell.yaml file to see if the connection to switch port xx-yy-torsw02:Eth1/10/2 comply with the rack mount
    vendorerrorcode: SWITCH_TEST_FAIL(0x01)
    errorcode: VAL-E1001
    mitigation: If this check failed, it usually means the connection from TorSwitch
      to Server does not match the cell configuration. Or the Server has the wrong
      MAC address for NIC port in the cell configuration. Check the artifacts directory
      or stdout to see which connection flagged.

解决方法

确保使用 show mac address-table 从 TOR 交换机设置到引导加载程序的连接。