Coming soon - Get a detailed view of why an account is flagged as spam!
view details

This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

187
Quick overview of price/preformance for text generation on different GPUs
Post Flair (click to view more posts with a particular flair)
Post Body

4 bit quant test

GPU VRAM (GB) SPEED (T/S) PRICE ($/HR) VALUE (T/$)
RTX A6000 48 52.77 0.79 240,508.60
RTX 6000 Ada 48 68.05 1.14 214,894.17
A40 48 45.17 0.79 205,840.98
L40 48 56.87 1.14 179,614.82
A100 SXM 80 61.96 2.29 97,413.77
RTX4090 24 12.61 0.74 61,362.23
H100 SXM5 80 43.00 4.69 33,008.47
RTX A5000 24 1.59 0.44 13,084.85

Re-test with lower quant for low vram GPUs

GPU VRAM (GB) SPEED (T/S) PRICE ($/HR) VALUE (T/$) QUANT MODEL
RTX 4000 Ada 20 46.75 0.39 431,538.46 2 mixtral:8x7b-instruct-v0.1-q2_K
A5000 24 49.45 0.44 404,590.91 2 mixtral:8x7b-instruct-v0.1-q2_K
RTX4090 24 75.88 0.74 369,145.95 2 mixtral:8x7b-instruct-v0.1-q2_K
RTX4090 24 69.01 0.74 335,724.32 3 mixtral:8x7b-instruct-v0.1-q3_K_S
A5000 24 38.25 0.44 312,954.55 3 mixtral:8x7b-instruct-v0.1-q3_K_S
A6000 48 57.67 0.79 262,800 2 mixtral:8x7b-instruct-v0.1-q2_K
RTX 6000 Ada 48 80.12 1.14 253,010.52 2 mixtral:8x7b-instruct-v0.1-q2_K
A40 48 51.05 0.79 232,632.91 2 mixtral:8x7b-instruct-v0.1-q2_K
RTX 6000 Ada 48 69.83 1.14 220,515.78 3 mixtral:8x7b-instruct-v0.1-q3_K_S
A6000 48 45.6 0.79 207,797.47 3 mixtral:8x7b-instruct-v0.1-q3_K_S
L40 48 65.55 1.14 207,000.00 2 mixtral:8x7b-instruct-v0.1-q2_K
A40 48 41.21 0.79 187,792.41 3 mixtral:8x7b-instruct-v0.1-q3_K_S
L40 48 56.71 1.14 179,084.21 3 mixtral:8x7b-instruct-v0.1-q3_K_S

I don't have pricing for the following card(s):

GPU VRAM (GB) SPEED (T/S) QUANT MODEL
TESLA P40 24 18.12 2 mixtral:8x7b-instruct-v0.1-q2_K
TESLA P40 24 8.43 4 mixtral:8x7b

note: GPT-3.5 comes in about 500,000 tokens/$

This is a test of the mixtral:8x7b model using ollama. I know it's not the most scientific method of researching (I could of used smaller quantization for the RTX A5000 for example and probably would of had much better results since it has 20GB of VRAM only).

There is a possibility that other components caused the bottleneck since each pod was slightly different.

Also note some GPUs have spot (interruptible) options so you can run them even cheaper. A6000 is 0.49$/hr for spot secure cloud and 0.34$/hr spot community cloud.

My workflow is right here in case you want to critique or try it out yourself.

EDIT:Pricing and testing is from runpod

EDIT2:

Thank you u/ibbobud and u/ReadyAndSalted and everyone else for your inputs.

I've re-run the tests with 2 and 3 bit quantization and the results are interesting! I've also fixed the pricing T/$ should be good now.

Added RTX 6000 Ada

Duplicate Posts
2 posts with the exact same title by 1 other authors
View Details
Author
Account Strength
90%
Account Age
6 years
Verified Email
Yes
Verified Flair
No
Total Karma
4,390
Link Karma
2,923
Comment Karma
1,412
Profile updated: 6 days ago

Subreddit

Post Details

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.
Posted
1 year ago