错误日志:
Epoch: [229] Total time: 0:17:21
Test: [ 0/49] eta: 0:05:00 loss: 1.7994 (1.7994) acc1: 78.0822 (78.0822) acc5: 95.2055 (95.2055) time: 6.1368 data: 5.9411 max mem: 10624
WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44348 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44349 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44350 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44351 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44352 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44353 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44354 closing signal SIGHUP
Traceback (most recent call last):File "/home/biometrics/miniconda3/envs/torch/bin/torchrun", line 33, in <module>sys.exit(load_entry_point('torch==1.12.0.dev20220502', 'console_scripts', 'torchrun')())File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapperreturn f(*args, **kwargs)File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/run.py", line 761, in mainrun(args)File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/run.py", line 755, in run)(*cmd_args)File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__return launch_agent(self._config, self._entrypoint, list(args))File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 236, in launch_agentresult = agent.run()File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapperresult = f(*args, **kwargs)File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in runresult = self._invoke_run(role)File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py", line 850, in _invoke_runtime.sleep(monitor_interval)File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handlerraise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 44343 got signal: 1
网上的解决办法是: