Import fails (and seems to restart)

changed the description

One solution might be to store a shared vocabulary somewhere (for the countvectoriser) and store all vectors in the database, e.g. via https://github.com/pgvector/pgvector This way, we don't have to build an approximate NN index each time, we could run multiple imports at the same time and it's a bit easier to do it.

We'd need to investigate the limits of pgvector (esp. dimensionality), compare alternatives), and understand what the storage requirements and lookup speeds are. Also important: can it do jaccard similarity?

Radical alternative would be to maintain a separate database just as a vector store. It would allow us to do great things beyond the platform too...

https://github.com/milvus-io/milvus and https://github.com/weaviate/weaviate are the most popular ones to my knowledge.

Imagine storing count vectors and scibert vectors of everything we import and just recalling most similar documents: Found a relevant document? Here are the 10 most similar ones (instantly, no classifier needed)

After careful discussions, @maxcall and I decided against a vector store for now. It is a great idea but would need some further investigation. For now, memory and speed issues are mostly resolved by added logic to the importer (checking known and trusted IDs before building an index).

Restarts should also be fixed via 4550dd5c

Closing this for now, will pick up vector stores in a separate issue later

closed

Import fails (and seems to restart)

Designs

Child items 0

Activity