|
|
|
**Symptom:**
|
|
|
|
1. SSH connection to cluster starts failing with `ssh: connect to host cluster.pik-potsdam.de port 22: Connection refused`
|
|
|
|
2. Attempts to login via ssh to the cluster, or to start a new process/task on a login node when already logged in, result in an error "fork: retry: No child processes"
|
|
|
|
|
|
|
|
**Problem:**
|
|
|
|
|
|
|
|
User process limit on the cluster is reached. This is often caused by leftover "vscode-server" processes (from the VSCode Remote SSH extension) or "ssh-agent"s not being cleared on logout.
|
|
|
|
|
|
|
|
This error is not limited to these processes. Any out-of-control self-spawning or recursive processes can trigger it.
|
|
|
|
|
|
|
|
|
|
|
|
**Prevention:**
|
|
|
|
|
|
|
|
a) If you use the VSCode Remote SSH extension, do not use cluster.pik-potsdam.de as the hostname. Instead, pick one of login01.pik-potsdam.de or login02.pik-potsdam.de. When ending a session, press CTRL + SHIFT + P and select “Kill VS Code Server on Host…” and select the same host.
|
|
|
|
|
|
|
|
This is because “cluster” is an alias for one or other of the login nodes, selected based on which node has the lowest current load. Repeated connections to “cluster” via the Remote SSH extension fail to reuse existing vscode-server processes, which then accumulate over time.
|
|
|
|
|
|
|
|
b) Set up a Bash function in your .bashrc and run it from time to time:
|
|
|
|
|
|
|
|
stop-vscode ()<br>
|
|
|
|
{<br>
|
|
|
|
ps -o pid,command -u $USER | grep --color=auto [v]scode-server | awk '{ print $1 }' | xargs kill -9<br>
|
|
|
|
}<br>
|
|
|
|
|
|
|
|
c) If you use ssh-agent in your login script, be sure to kill the process when you log out. One way to do this is to run `ssh-agent -k`.
|
|
|
|
|
|
|
|
**Fixes/Clearing out existing processes:**
|
|
|
|
|
|
|
|
- If you can still connect to the cluster, you can run `ps -u <username>` on both login nodes to find out which processes are causing the problem. You can then manually kill these processes.
|
|
|
|
- If you are unable to connect, ask another user or a cluster administrator to run this command for you.
|
|
|
|
|
|
|
|
If VS-Code processes are indeed the problem:
|
|
|
|
**Stop all leftover VS-Code processes by running:**
|
|
|
|
`ssh <username>@login01.pik-potsam.de "killall -u <username> vscode-server"`
|
|
|
|
|
|
|
|
Replace login01 with login02 depending on the affected node, and <username> with your cluster username.
|
|
|
|
|
|
|
|
Note: This command will kill all named processes running on that login node. Use with care.
|
|
|
|
|
|
|
|
For vscode-server processes, an alternative method is to delete the .vscode-server subdirectory in your home directory on the cluster. If you can mount your home directory, you can do this:
|
|
|
|
|
|
|
|
Linux:<br>
|
|
|
|
`sshfs username@cluster.pik-potsdam.de:/home/username mountpoint`<br>
|
|
|
|
`cd mountpoint`<br>
|
|
|
|
`rm -R .vscode-server`<br>
|
|
|
|
|
|
|
|
Windows:<br>
|
|
|
|
press Win + R<br>
|
|
|
|
type \\home\user to open your home directory in Windows Explorer<br>
|
|
|
|
remove .vscode-server directory<br> |