This post has been de-listed
It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.
4 bit quant test
GPU | VRAM (GB) | SPEED (T/S) | PRICE ($/HR) | VALUE (T/$) |
---|---|---|---|---|
RTX A6000 | 48 | 52.77 | 0.79 | 240,508.60 |
RTX 6000 Ada | 48 | 68.05 | 1.14 | 214,894.17 |
A40 | 48 | 45.17 | 0.79 | 205,840.98 |
L40 | 48 | 56.87 | 1.14 | 179,614.82 |
A100 SXM | 80 | 61.96 | 2.29 | 97,413.77 |
RTX4090 | 24 | 12.61 | 0.74 | 61,362.23 |
H100 SXM5 | 80 | 43.00 | 4.69 | 33,008.47 |
RTX A5000 | 24 | 1.59 | 0.44 | 13,084.85 |
Re-test with lower quant for low vram GPUs
GPU | VRAM (GB) | SPEED (T/S) | PRICE ($/HR) | VALUE (T/$) | QUANT | MODEL |
---|---|---|---|---|---|---|
RTX 4000 Ada | 20 | 46.75 | 0.39 | 431,538.46 | 2 | mixtral:8x7b-instruct-v0.1-q2_K |
A5000 | 24 | 49.45 | 0.44 | 404,590.91 | 2 | mixtral:8x7b-instruct-v0.1-q2_K |
RTX4090 | 24 | 75.88 | 0.74 | 369,145.95 | 2 | mixtral:8x7b-instruct-v0.1-q2_K |
RTX4090 | 24 | 69.01 | 0.74 | 335,724.32 | 3 | mixtral:8x7b-instruct-v0.1-q3_K_S |
A5000 | 24 | 38.25 | 0.44 | 312,954.55 | 3 | mixtral:8x7b-instruct-v0.1-q3_K_S |
A6000 | 48 | 57.67 | 0.79 | 262,800 | 2 | mixtral:8x7b-instruct-v0.1-q2_K |
RTX 6000 Ada | 48 | 80.12 | 1.14 | 253,010.52 | 2 | mixtral:8x7b-instruct-v0.1-q2_K |
A40 | 48 | 51.05 | 0.79 | 232,632.91 | 2 | mixtral:8x7b-instruct-v0.1-q2_K |
RTX 6000 Ada | 48 | 69.83 | 1.14 | 220,515.78 | 3 | mixtral:8x7b-instruct-v0.1-q3_K_S |
A6000 | 48 | 45.6 | 0.79 | 207,797.47 | 3 | mixtral:8x7b-instruct-v0.1-q3_K_S |
L40 | 48 | 65.55 | 1.14 | 207,000.00 | 2 | mixtral:8x7b-instruct-v0.1-q2_K |
A40 | 48 | 41.21 | 0.79 | 187,792.41 | 3 | mixtral:8x7b-instruct-v0.1-q3_K_S |
L40 | 48 | 56.71 | 1.14 | 179,084.21 | 3 | mixtral:8x7b-instruct-v0.1-q3_K_S |
I don't have pricing for the following card(s):
GPU | VRAM (GB) | SPEED (T/S) | QUANT | MODEL |
---|---|---|---|---|
TESLA P40 | 24 | 18.12 | 2 | mixtral:8x7b-instruct-v0.1-q2_K |
TESLA P40 | 24 | 8.43 | 4 | mixtral:8x7b |
note: GPT-3.5 comes in about 500,000 tokens/$
This is a test of the mixtral:8x7b model using ollama. I know it's not the most scientific method of researching (I could of used smaller quantization for the RTX A5000 for example and probably would of had much better results since it has 20GB of VRAM only).
There is a possibility that other components caused the bottleneck since each pod was slightly different.
Also note some GPUs have spot (interruptible) options so you can run them even cheaper. A6000 is 0.49$/hr for spot secure cloud and 0.34$/hr spot community cloud.
My workflow is right here in case you want to critique or try it out yourself.
EDIT:Pricing and testing is from runpod
EDIT2:
Thank you u/ibbobud and u/ReadyAndSalted and everyone else for your inputs.
I've re-run the tests with 2 and 3 bit quantization and the results are interesting! I've also fixed the pricing T/$
should be good now.
Added RTX 6000 Ada
Subreddit
Post Details
- Posted
- 1 year ago
- Reddit URL
- View post on reddit.com
- External URL
- reddit.com/r/LocalLLaMA/...