Import fails (and seems to restart)
The weird pulsating RAM usage might be actually a restarting import task: http://srv-mcc-apsis-n:8071/monitorix
Queue dashboard: http://10.10.13.45:8085/
Excerpt from log
Jul 18 11:38:03 srv-mcc-apsis-n dramatiq[984499]: [2024-07-18 11:38:03,941] [PID 984500] [Thread-3] [nacsos_data.util.deduplicate.index] [INFO] Constructing nearest neighbour lookup...
Jul 18 11:45:01 srv-mcc-apsis-n dramatiq[984499]: [2024-07-18 11:45:01,738] [PID 984499] [MainThread] [dramatiq.MainProcess] [CRITICAL] Worker with PID 984500 exited unexpectedly (code -9). Shutting down...
Jul 18 11:45:01 srv-mcc-apsis-n dramatiq[984499]: [2024-07-18 11:45:01,739] [PID 984626] [MainThread] [dramatiq.ForkProcess(0)] [INFO] Stopping fork process...
Jul 18 11:45:01 srv-mcc-apsis-n dramatiq[984499]: [2024-07-18 11:45:01,739] [PID 984501] [MainThread] [dramatiq.WorkerProcess(1)] [INFO] Stopping worker process...
Jul 18 11:45:01 srv-mcc-apsis-n dramatiq[984499]: [2024-07-18 11:45:01,739] [PID 984502] [MainThread] [dramatiq.WorkerProcess(2)] [INFO] Stopping worker process...
Jul 18 11:45:02 srv-mcc-apsis-n dramatiq[984499]: [2024-07-18 11:45:02,729] [PID 984501] [MainThread] [dramatiq.worker.Worker] [INFO] Shutting down...
Jul 18 11:45:02 srv-mcc-apsis-n dramatiq[984499]: [2024-07-18 11:45:02,731] [PID 984502] [MainThread] [dramatiq.worker.Worker] [INFO] Shutting down...
Jul 18 11:45:03 srv-mcc-apsis-n dramatiq[984499]: [2024-07-18 11:45:03,444] [PID 984501] [MainThread] [dramatiq.middleware.asyncio.AsyncIO] [INFO] Stopping event loop...
Jul 18 11:45:03 srv-mcc-apsis-n dramatiq[984499]: [2024-07-18 11:45:03,451] [PID 984501] [MainThread] [dramatiq.worker.Worker] [INFO] Worker has been shut down.
Jul 18 11:45:03 srv-mcc-apsis-n dramatiq[984499]: [2024-07-18 11:45:03,598] [PID 984502] [MainThread] [dramatiq.middleware.asyncio.AsyncIO] [INFO] Stopping event loop...
Jul 18 11:45:03 srv-mcc-apsis-n dramatiq[984499]: [2024-07-18 11:45:03,600] [PID 984502] [MainThread] [dramatiq.worker.Worker] [INFO] Worker has been shut down.
Jul 18 11:45:04 srv-mcc-apsis-n systemd[1]: nacsos-dramatiq.service: Main process exited, code=exited, status=1/FAILURE
Jul 18 11:45:04 srv-mcc-apsis-n systemd[1]: nacsos-dramatiq.service: Failed with result 'exit-code'.
Jul 18 11:45:04 srv-mcc-apsis-n systemd[1]: nacsos-dramatiq.service: Consumed 1h 56min 5.666s CPU time.
Jul 18 11:45:04 srv-mcc-apsis-n systemd[1]: nacsos-dramatiq.service: Scheduled restart job, restart counter is at 72.
Jul 18 11:45:04 srv-mcc-apsis-n systemd[1]: Stopped dramatiq workers.
Jul 18 11:45:04 srv-mcc-apsis-n systemd[1]: nacsos-dramatiq.service: Consumed 1h 56min 5.666s CPU time.
Jul 18 11:45:04 srv-mcc-apsis-n systemd[1]: Started dramatiq workers.
Other logs via
sudo journalctl -eu nacsos-dramatiq
sudo journalctl -eu nacsos-core
sudo journalctl -eu drama-dash
The main issue seems to be RAM running out when building the index. So we'd need some experimentation to see how to reduce the footprint or at least verify this is the issue (currently using this nearest neighbour index: https://github.com/lmcinnes/pynndescent)
There are a bunch of alternatives. I used to be a big fan of HNSW. It's fast and efficient, but it has this annoying property of failing when you insert duplicates (which defeats the purpose here). https://github.com/erikbern/ann-benchmarks I don't have time to fix the issue of dealing with massive projects. Happy to guide someone and review code, but the research (which index to use and testing it) has to be done by someone else.
It seems to fail mid-way of building the deduplication index, the worker dies, all other workers get killed, the service restarts and apparently the task is started again. Looks like the default is to try 20 times https://dramatiq.io/guide.html#message-retries I think this is where the parameter can be changed so I set it to 0: https://gitlab.pik-potsdam.de/mcc-apsis/nacsos/nacsos-core/-/blob/main/server/pipelines/actor.py?ref_type=heads#L70