预计完成时间:2 小时
可操作组件的所有者:OELCM技能配置文件:部署工程师
18.1. 配置检查
如需确认 HPE 交付的 Google Distributed Cloud (GDC) 气隙硬件和软件资产的质量、安全性和有效性,以确保其已准备好投入生产,请使用 Distributed Cloud 发行版的验证 CLI。
验证套件可测试设备的运行状况、安装和配置,并包含用于验证服务器、网络交换机、文件/ 块存储、对象存储、防火墙和 HSM 的测试(仅举几例)。
如需验证硬件,请完成以下步骤:
在引导加载程序机器上运行具有 root 访问权限的验证 CLI 命令
sudo:sudo RELEASE_DIR/gdcloud system check-config --config CELL_CONFIG_PATH --artifacts-directory ARTIFACTS_DIR --scenario ConfigCheck此命令会记录 ARTIFACTS_DIR 中的所有日志。
如果发现任何错误,请根据错误消息解决所有问题。重新运行验证。
如果所有报告的健康状况都良好,请继续执行下一步。
18.2. 潜在问题
本部分包含在对 Distributed Cloud 实例执行安装后验证时可能遇到的潜在问题。
18.2.1. 所有 Google Distributed Cloud 版本中的潜在问题
18.2.1.1. 网络检查错误地标记了连接到配线架的存储设备
问题:
检查失败,并显示摘要文本:Storage network connection mismatched
详细文本如下所示:
Got: xx-ab-stge01-01:e0g<>xx-ab-torsw02 (:::::):Ethernet1/1/1,
want: expected: xx-ab-stge01-01:e0g<>xx-ab-ppl01:r04Ap01BO-ft
关键症状是检查的第二部分包含某种配线架标签,例如 r04Ap01BO-ft。
解决方法:
在 assets/inv/inv-core.yaml 文件中找到的单元格 CR 中进行手动检查:
使用示例故障:Got: xx-ab-stge01-01:e0g<>xx-ab-torsw02 (:::::):Ethernet1/1/1,
want: expected: xx-ab-stge01-01:e0g<>xx-ab-ppl01:r04Ap01BO-ft
- 确认存在具有命名存储设备和配线架的条目。
例如:xx-ab-stge01-01:e0g<>xx-ab-ppl01:r04Ap01BO-ft 变为:
- cableType: MMF
color: Aqua
endA: xx-ab-stge01-01:e0g
endATransceiverMPN: X65404-N-C
endB: xx-ab-ppl01:r04Ap01BO-ft
length: 2m
mpn: 'OM4LCDX #40220 (2m)'
- 确认映射配线架链接到指定的 torswitch。
您可以通过以下方式找到配线架的另一侧:获取 r04Ap01BO-ft,然后获取包含 r 和数字的第一部分,并将 -ft 更改为 -bk。r04Ap01BO-ft 和 r04Ap02BO-ft 映射到 r04Ap01BO-bk
- cableType: MMF
color: Magenta
endA: xx-ab-torsw02:Eth1/1
endATransceiverMPN: QSFP-100G-SL4
endB: xx-ab-ppl01:r04Ap01BO-bk
length: 1.5m
mpn: '12FMTPOM4 #73704 (1.5m)'
notes: 25Gb breakout
线缆入口的另一端应与检查的第一部分相匹配,在本例中:
以太网 1/1/1 表示 torsw02 上的物理端口 1 通过分线盒连接到第一个分线。
如果映射看起来正确,您可以忽略此检查。
18.2.1.2. 对象存储网站上的对账错误(DNS 后缀错误)
问题:
ObjectStorageSite 自定义资源设置为 Ready: false,其日志报告 Reconcile error, retrying: failed to parse location, found malformed DNSSuffix。
解决方法:
忽略错误。在安装过程中的“根管理员集群引导”步骤完成后,这些文件会消失。
18.2.1.3. 根管理员集群的裸机设置不正确
验证输出中的失败示例:
- passed: false
description: |-
BMM setting validation on server xx-yy-bm01 failed with error:
server has unexpected settings:
/redfish/v1/Systems/1/SecureBoot.SecureBootEnable is true, want false
target: xx-yy-bm01
targettype: ServerSettings
vendorerrorcode: SERVER_TEST_FAIL(0x04)
gpcerrorcode: FailedInBMMSetting
mitigation: Refer to the artifact to see which server flags. Check the connection
to the server iLO port. Check the account of iLO. Check if the iLO and server
are fully powered up. Check the concerned settings of server ah-ab-bm01.
18.2.1.4. 插线板不匹配
问题:
硬件检查应以连接末端的设备为目标,而不是直接连接的设备 (xx-xx-ppl)。
示例:
- description: This check validates the storage network connection against the cell
configuration.
target: xx-yy-stge01-01:e0e<>xx-yy-torsw01 (aa:aa:aa:aa:aa:aa):Ethernet1/1/1
targettype: ""
checkresult:
passed: false
summary: Storage network connection mismatched.
detail: 'Got: xx-yy-stge01-01:e0e<>xx-yy-torsw01 (aa:aa:aa:aa:aa:aa):Ethernet1/1/1,
want: expected: xx-yy-stge01-01:e0e<>xx-yy-ppl01:r03Ap01BO-ft'
vendorerrorcode: ""
errorcode: VAL-E3026
mitigation: If this check fails, it can indicate that the Storage system is not
configurated according to the configuration file. Adjust the cabling so it matches
with the cell configuration.
解决方法:
忽略这些错误。
18.2.1.5. Ping 测试失败
问题:
这是 CDP 生理行为,因为需要进行 ARP 泛洪才能在交换机上填充 CAM 表,并能够访问设备。预计前 1-5 个数据包被丢弃的几率较高。
示例:
- description: This check validates the link quality from the management switches
to other switches and baremetal node by measuring the packet delivery ratio of
100 ping requests.
target: xx-yy-mgmtsw01
targettype: ManagementSwitch
checkresult:
passed: false
summary: Link quality from ManagementSwitch to other devices is degraded.
detail: |-
Check the cable connections of management switch xx-yy-mgmtsw01.
Error:
ping test failed on link xx-yy-mgmtsw01:Eth1/52<>xx-yz-mgmtaggsw01:Eth1/1 with 1 packets dropped in 100 packets send
ping test failed on link xx-yy-mgmtsw01:Eth1/32<>xx-yy-aggsw01:mgmt0 with 1 packets dropped in 100 packets send
ping test failed on link xx-yy-mgmtsw01:Eth1/36<>xx-yy-mgmtaggsw01:mgmt0 with 1 packets dropped in 100 packets send
ping test failed on link xx-yy-mgmtsw01:Eth1/41<>xx-yy-torsw02:mgmt0 with 1 packets dropped in 100 packets send
ping test failed on link xx-yy-mgmtsw01:Eth1/42<>xx-yy-torsw01:mgmt0 with 1 packets dropped in 100 packets send
ping test failed on link xx-yy-mgmtsw01:Eth1/51<>xx-yy-mgmtaggsw01:Eth1/1 with 1 packets dropped in 100 packets send
ping test failed on link xx-yy-mgmtsw01:Eth1/45<>xx-yy-base02:ilo with 1 packets dropped in 100 packets send
ping test failed on link xx-yy-mgmtsw01:Eth1/46<>xx-yy-base03:ilo with 1 packets dropped in 100 packets send
ping test failed on link xx-yy-mgmtsw01:Eth1/24<>xx-yy-base03:LOM1 with 1 packets dropped in 100 packets send.
vendorerrorcode: SWITCH_TEST_FAIL(0x01)
errorcode: VAL-E1003
mitigation: If this check failed, it usually means the network cables from the
management switch need to be inspected or replaced. Check the artifacts directory
or stdout to see which cable flagged.
解决方法:
忽略这些错误。
18.2.1.6. ONTAP 存储集群名称检查
问题:
自动化操作正在查找 ONTAP 设备主机名,而 ONTAP 设备在交换机上显示为序列号。
示例:
- description: This check validates the storage cluster name and management interface
are consistent between netapp ontap client and the cell configuration.
target: yy-stge-clus-01
targettype: StorageCluster
checkresult:
passed: false
summary: StorageCluster management interface cannot be found.
detail: StorageCluster management interface x.x.x.x in the cell configuration
cannot be found in the netapp ontap client.
vendorerrorcode: STORAGE_TEST_FAIL(0x03)
errorcode: VAL-E3007
mitigation: Review if management IPfor StorageCluster yy-stge-clus-01 in the cell
configuration is correct.
解决方法:
忽略这些错误。
18.2.1.7. 引导加载程序 LLDP 发现失败
问题:
show lldp neighbors 无法从 TOR 交换机找到引导加载程序。这似乎是因为引导加载程序 (Ubuntu) 上的操作系统不响应 LLDP 请求。
示例:
- description: This check validates the connection between TorSwitch and Server. The
connection is retriveved via "show lldp neighbors" and cross check with the MAC
address for NIC port from Server defined in the cell configuration.
target: xx-yy-torsw02
targettype: TORSwitch
checkresult:
passed: false
summary: Connection between TorSwitch and Server does not match with the cell
configuration.
detail: |-
Check the cable connection between TorSwitch and Server.
Error:
the BM server port xx-yy-bm15:s1p2 could not be found in the rack. Check if the server xx-yy-bm15 is powered up. If the server is powered up, check th
e cell.yaml file to see if the connection to switch port xx-yy-torsw02:Eth1/10/2 comply with the rack mount
vendorerrorcode: SWITCH_TEST_FAIL(0x01)
errorcode: VAL-E1001
mitigation: If this check failed, it usually means the connection from TorSwitch
to Server does not match the cell configuration. Or the Server has the wrong
MAC address for NIC port in the cell configuration. Check the artifacts directory
or stdout to see which connection flagged.
解决方法:
确保使用 show mac address-table 从 TOR 交换机设置到引导加载程序的连接。