Cassandra pod 未在次要区域中启动

您正在查看 Apigee 和 Apigee Hybrid 文档。
查看 Apigee Edge 文档。

症状

Cassandra pod 无法在多区域 Apigee 混合设置中的一个区域启动。应用 overrides.yaml 文件时，Cassandra pod 启动失败。

错误消息

您将在 Cassandra pod 日志中观察以下错误消息：

Exception (java.lang.RuntimeException) encountered during startup:
A node with address 10.52.18.40 already exists, cancelling join.
use cassandra.replace_addrees if you want to replace this node.

您可能会在 Cassandra pod 状态中看到以下警告：

可能的原因

此问题通常在以下情况下出现：

Apigee 运行时集群会在其中一个区域中删除。
在带有 Cassandra 种子主机配置的区域中，会尝试根据 overrides.yaml 文件重新安装 Apigee 运行时集群，如在 GKE 和 GKE 本地环境上的多区域部署中所述。
删除 Apigee 运行时集群不会移除 Cassandra 集群中的引用。因此，系统将保留已删除的集群中 Cassandra pod 的过时引用。因此，如果您尝试在次要区域重新安装 Apigee 运行时集群，Cassandra pod 会指出某些 IP 地址已存在。这是因为 IP 地址可以从之前使用的子网中分配。

原因	说明
对 Cassandra 集群中已删除的次要区域 Pod 的引用已过期	删除次要区域中的 Apigee 运行时集群不会移除对次要区域中 Cassandra pod 的 IP 地址的引用。

原因：对 Cassandra 集群中已删除的次要区域 Pod 的引用已过期

诊断

Cassandra pod 日志 A node with address 10.52.18.40 already exists 中的错误消息表示 IP 地址为 10.52.18.40 的次要地区 Cassandra pod 存在过时的引用。通过在主要区域中运行 nodetool status 命令来验证这一点。
示例输出：

上面的示例显示，输出中仍列出了与次要区域的 Cassandra pod 关联的 IP 地址 10.52.18.40。
如果输出包含对次要区域中的 Cassandra pod 的过时引用，则说明次要区域已被删除，但次要区域中 Cassandra pod 的 IP 地址不会被移除。

解决方法

执行以下步骤以移除已删除集群的 Cassandra pod 的过时引用：

按照创建客户端容器中的步骤登录容器并连接到 Cassandra 命令行界面。

登录容器并连接到 Cassandra cqlsh 界面后，运行以下 SQL 查询以列出当前的 keyspace 定义：

select * from system_schema.keyspaces;

显示当前键空间的示例输出：

在以下输出中，Primary-DC1 表示主要区域，Secondary-DC2 表示次要区域。

bash-4.4# cqlsh 10.50.112.194 -u admin_user -p ADMIN.PASSWORD --ssl
Connected to apigeecluster at 10.50.112.194:9042.
[cqlsh 5.0.1 | Cassandra 3.11.6 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.

admin_user@cqlsh> Select * from system_schema.keyspaces;

keyspace_name                        | durable_writes | replication
-------------------------------------+----------------+--------------------------------------------------------------------------------------------------
system_auth                          |           True | {'Primary-DC1': '3', 'Secondary-DC2': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}
kvm_tsg1_apigee_hybrid_prod_hybrid   |           True | {'Primary-DC1': '3', 'Secondary-DC2': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}
kms_tsg1_apigee_hybrid_prod_hybrid   |           True | {'Primary-DC1': '3', 'Secondary-DC2': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}
system_schema                        |           True |                                           {'class': 'org.apache.cassandra.locator.LocalStrategy'}
system_distributed                   |           True | {'Primary-DC1': '3', 'Secondary-DC2': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}
system                               |           True |                                           {'class': 'org.apache.cassandra.locator.LocalStrategy'}
perses                               |           True | {'Primary-DC1': '3', 'Secondary-DC2': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}
cache_tsg1_apigee_hybrid_prod_hybrid |           True | {'Primary-DC1': '3', 'Secondary-DC2': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}
rtc_tsg1_apigee_hybrid_prod_hybrid   |           True | {'Primary-DC1': '3', 'Secondary-DC2': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}
quota_tsg1_apigee_hybrid_prod_hybrid |           True | {'Primary-DC1': '3', 'Secondary-DC2': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}
system_traces                        |           True | {'Primary-DC1': '3', 'Secondary-DC2': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}
(11 rows)

如您所见，即使 Apigee 运行时集群在次要区域中被删除，keyspaces 也同时引用 Primary-DC1 和 Secondary-DC2。

必须从每个 keyspace 定义中删除对 Secondary-DC2 的过时引用。

在删除 keyspace 定义中的过时引用之前，请使用以下命令从 Secondary-DC2 中删除除 ASM (Istio) 和 cert-manager 之外的整个 Apigee Hybrid 安装。如需了解详情，请参阅卸载 Hybrid 运行时。
```
helm uninstall -n APIGEE_NAMESPACE ENV_GROUP_RELEASE_NAME ENV_RELEASE_NAME $ORG_NAME ingress-manager telemetry redis datastore
```
此外，还请卸载 apigee-operator：
```
helm uninstall -n APIGEE_NAMESPACE operator
```

通过更改 keyspace 定义，从每个 keyspaces 中移除对 Secondary-DC2 的过时引用。

ALTER KEYSPACE system_auth WITH replication = {'Primary-DC1': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'};
ALTER KEYSPACE kvm_ORG_NAME_apigee_hybrid_prod_hybrid WITH replication = {'Primary-DC1': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'};
ALTER KEYSPACE kms_ORG_NAME_apigee_hybrid_prod_hybrid WITH replication = {'Primary-DC1': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'};
ALTER KEYSPACE system_distributed WITH replication = {'Primary-DC1': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'};
ALTER KEYSPACE perses WITH replication = {'Primary-DC1': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'};
ALTER KEYSPACE cache_ORG_NAME_apigee_hybrid_ENV_NAME_hybrid WITH replication = {'Primary-DC1': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'};
ALTER KEYSPACE rtc_ORG_NAME_apigee_hybrid_ENV_NAME_hybrid WITH replication = {'Primary-DC1': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'};
ALTER KEYSPACE quota_ORG_NAME_apigee_hybrid_ENV_NAME_hybrid WITH replication = {'Primary-DC1': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'};
ALTER KEYSPACE system_traces WITH replication = {'Primary-DC1': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'};

通过运行以下命令，验证是否已从所有 keyspaces 中移除对 Secondary-DC2 区域的过时引用：
```
select * from system_schema.keyspaces;
```
登录 Primary-DC1 的 Cassandra pod，并移除对 Secondary-DC2 的所有 Cassandra pod 的 UUID 的引用。可以从 nodetool status 命令获取 UUID，如前面的诊断中所述。
```
kubectl exec -it -n apigee apigee-cassandra-default-0 -- bash
nodetool -u admin_user -pw ADMIN.PASSWORD removenode UUID_OF_CASSANDRA_POD_IN_SECONDARY_DC2
```
通过再次运行 nodetool status 命令来验证 Secondary-DC2 是否没有 Cassandra pod。
按照 GKE 和 GKE On-Prem 上的多区域部署中的步骤，在次要区域 (Secondary-DC2) 安装 Apigee 运行时集群。

必须收集的诊断信息

如果按照上述说明操作后问题仍然存在，请收集以下诊断信息，然后与 Google Cloud Customer Care 联系：

Google Cloud 项目 ID
Apigee Hybrid 组织的名称
来自主要和次要区域的 overrides.yaml 文件，遮盖所有敏感信息
主要区域和次要区域的所有命名空间中的 Kubernetes pod 状态：
```
kubectl get pods -A > kubectl-pod-status`date +%Y.%m.%d_%H.%M.%S`.txt
```

从主要区域和次要区域执行的 kubernetes cluster-info 转储：

# generate kubernetes cluster-info dump
kubectl cluster-info dump -A --output-directory=/tmp/kubectl-cluster-info-dump

# zip kubernetes cluster-info dump
zip -r kubectl-cluster-info-dump`date +%Y.%m.%d_%H.%M.%S`.zip /tmp/kubectl-cluster-info-dump/*

来自主要区域的以下 nodetool 命令的输出。

export u=`kubectl -n apigee get secrets apigee-datastore-default-creds -o jsonpath='{.data.jmx\.user}' | base64 -d`
export pw=`kubectl -n apigee get secrets apigee-datastore-default-creds -o jsonpath='{.data.jmx\.password}' | base64 -d`

kubectl -n apigee exec -it apigee-cassandra-default-0 -- nodetool -u $u -pw $pw info 2>&1 | tee /tmp/k_nodetool_info_$(date +%Y.%m.%d_%H.%M.%S).txt
kubectl -n apigee exec -it apigee-cassandra-default-0 -- nodetool -u $u -pw $pw describecluster 2>&1 | tee /tmp/k_nodetool_describecluster_$(date +%Y.%m.%d_%H.%M.%S).txt
kubectl -n apigee exec -it apigee-cassandra-default-0 -- nodetool -u $u -pw $pw failuredetector 2>&1 | tee /tmp/k_nodetool_failuredetector_$(date +%Y.%m.%d_%H.%M.%S).txt
kubectl -n apigee exec -it apigee-cassandra-default-0 -- nodetool -u $u -pw $pw status 2>&1 | tee /tmp/k_nodetool_status_$(date +%Y.%m.%d_%H.%M.%S).txt
kubectl -n apigee exec -it apigee-cassandra-default-0 -- nodetool -u $u -pw $pw gossipinfo 2>&1 | tee /tmp/k_nodetool_gossipinfo_$(date +%Y.%m.%d_%H.%M.%S).txt
kubectl -n apigee exec -it apigee-cassandra-default-0 -- nodetool -u $u -pw $pw netstats 2>&1 | tee /tmp/k_nodetool_netstats_$(date +%Y.%m.%d_%H.%M.%S).txt
kubectl -n apigee exec -it apigee-cassandra-default-0 -- nodetool -u $u -pw $pw proxyhistograms 2>&1 | tee /tmp/k_nodetool_proxyhistograms_$(date +%Y.%m.%d_%H.%M.%S).txt
kubectl -n apigee exec -it apigee-cassandra-default-0 -- nodetool -u $u -pw $pw tpstats 2>&1 | tee /tmp/k_nodetool_tpstats_$(date +%Y.%m.%d_%H.%M.%S).txt
kubectl -n apigee exec -it apigee-cassandra-default-0 -- nodetool -u $u -pw $pw gcstats 2>&1 | tee /tmp/k_nodetool_gcstats_$(date +%Y.%m.%d_%H.%M.%S).txt
kubectl -n apigee exec -it apigee-cassandra-default-0 -- nodetool -u $u -pw $pw version 2>&1 | tee /tmp/k_nodetool_version_$(date +%Y.%m.%d_%H.%M.%S).txt
kubectl -n apigee exec -it apigee-cassandra-default-0 -- nodetool -u $u -pw $pw ring 2>&1 | tee /tmp/k_nodetool_ring_$(date +%Y.%m.%d_%H.%M.%S).txt