![AWS pcluster 失敗,並顯示 MasterServerWaitCondition 收到 FAILURE 訊號、iptables 和 Chef 版本錯誤](https://rvso.com/image/768939/AWS%20pcluster%20%E5%A4%B1%E6%95%97%EF%BC%8C%E4%B8%A6%E9%A1%AF%E7%A4%BA%20MasterServerWaitCondition%20%E6%94%B6%E5%88%B0%20FAILURE%20%E8%A8%8A%E8%99%9F%E3%80%81iptables%20%E5%92%8C%20Chef%20%E7%89%88%E6%9C%AC%E9%8C%AF%E8%AA%A4.png)
我正在嘗試為並行集群建立 AMI。我使用了亞馬遜的庫存 AMI(ami-0436692c7b452bae4 用於 us-west-2、我所在的區域和 alinux),並通過添加一些軟體包對其進行了輕微修改。
但是,當我運行時,pcluster create foo --norollback
我收到錯誤:
Beginning cluster creation for cluster: stockAWS
Creating stack named: parallelcluster-stockAWS
Status: parallelcluster-stockAWS - ROLLBACK_IN_PROGRESS
Cluster creation failed. Failed events:
- AWS::AutoScaling::AutoScalingGroup ComputeFleet Resource creation cancelled
- AWS::CloudFormation::WaitCondition MasterServerWaitCondition Received FAILURE signal with UniqueId i-booyaa
然後我運行ssh foo
並查看日誌,其中/var/log/cfncluster-init.log
顯示了很長的錯誤日誌,我在其底部提供了:
2021-07-28 23:16:49,659 [ERROR] Command chef (chef-client --local-mode --config /etc/chef/client.rb --log_level auto --force-formatter --no-color --chef-zero-port 8889 --json-attributes /etc/chef/dna.json --override-runlist aws-parallelcluster::_prep_env) failed
2021-07-28 23:16:49,659 [DEBUG] Command chef output: Starting Chef Client, version 14.2.0
[2021-07-28T23:16:47+00:00] WARN: Run List override has been provided.
[2021-07-28T23:16:47+00:00] WARN: Run List override has been provided.
[2021-07-28T23:16:47+00:00] WARN: Original Run List: [recipe[aws-parallelcluster::slurm_config]]
[2021-07-28T23:16:47+00:00] WARN: Original Run List: [recipe[aws-parallelcluster::slurm_config]]
[2021-07-28T23:16:47+00:00] WARN: Overridden Run List: [recipe[aws-parallelcluster::_prep_env]]
[2021-07-28T23:16:47+00:00] WARN: Overridden Run List: [recipe[aws-parallelcluster::_prep_env]]
resolving cookbooks for run list: ["aws-parallelcluster::_prep_env"]
Synchronizing Cookbooks:
- aws-parallelcluster (2.5.1)
- poise-python (1.7.0)
- tar (2.1.1)
- selinux (2.1.1)
- nfs (2.6.4)
- yum (5.1.0)
- yum-epel (3.1.0)
- openssh (2.6.3)
- apt (7.0.0)
- hostname (0.4.2)
- line (2.4.1)
- ulimit (1.0.0)
- pyenv (3.1.1)
- kernel_module (1.1.2)
- poise (2.8.2)
- poise-languages (2.1.2)
- iptables (8.0.0)
- hostsfile (3.0.1)
- poise-archive (1.5.0)
Running handlers:
[2021-07-28T23:16:49+00:00] ERROR: Running exception handlers
[2021-07-28T23:16:49+00:00] ERROR: Running exception handlers
Running handlers complete
[2021-07-28T23:16:49+00:00] ERROR: Exception handlers complete
[2021-07-28T23:16:49+00:00] ERROR: Exception handlers complete
Chef Client failed. 0 resources updated in 11 seconds
[2021-07-28T23:16:49+00:00] FATAL: Stacktrace dumped to /etc/chef/local-mode-cache/cache/chef-stacktrace.out
[2021-07-28T23:16:49+00:00] FATAL: Stacktrace dumped to /etc/chef/local-mode-cache/cache/chef-stacktrace.out
[2021-07-28T23:16:49+00:00] FATAL: Please provide the contents of the stacktrace.out file if you file a bug report
[2021-07-28T23:16:49+00:00] FATAL: Please provide the contents of the stacktrace.out file if you file a bug report
[2021-07-28T23:16:49+00:00] FATAL: Chef::Exceptions::CookbookChefVersionMismatch: Cookbook 'iptables' version '8.0.0' depends on chef version [">= 15.3"], but the running chef version is 14.2.0
[2021-07-28T23:16:49+00:00] FATAL: Chef::Exceptions::CookbookChefVersionMismatch: Cookbook 'iptables' version '8.0.0' depends on chef version [">= 15.3"], but the running chef version is 14.2.0
2021-07-28 23:16:49,659 [ERROR] Error encountered during build of chefPrepEnv: Command chef failed
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/cfnbootstrap/construction.py", line 573, in run_config
CloudFormationCarpenter(config, self._auth_config).build(worklog)
File "/usr/lib/python3.7/site-packages/cfnbootstrap/construction.py", line 273, in build
self._config.commands)
File "/usr/lib/python3.7/site-packages/cfnbootstrap/command_tool.py", line 127, in apply
raise ToolError(u"Command %s failed" % name)
cfnbootstrap.construction_errors.ToolError: Command chef failed
2021-07-28 23:16:49,661 [ERROR] -----------------------BUILD FAILED!------------------------
如果我跑步,iptables --version
我就會得到v1.8.4
。使用 sudo 運行它也是如此。廚師是14.2.0
令人沮喪的是,如果我使用庫存 aws AMI 建立並行叢集堆疊,我會得到完全相同的行為。這裡發生了什麼事?