I have a p40 I'd be glad to run a benchmark on, just tell me how. I have Ooba and llama.cpp installed on linux Ubuntu 22.04, it's a Dell r620 with 2 x 12 3.5 Ghz cores (2 threads per core for 48 threads) Xeon with 256GB ram @ 1833Mhz, I have a pci-e gen 1 20 slot backplane. The speed of the pci-e bus might impact the loading time of the large models, but seems to not affect the speed of inference.
I went for the p40 for costs per GB of vram, speed was less important to me than being able to load the larger models at all. Including the fan and fan coupling i'm all in about $250 per card. I'm planning on adding more in the future, I to suffer from too many pci-e slots.
The cuda version I dont think will become an issue anytime to soon but is coming to be sure.