[D] Would a Tesla M40 provide cheap inference acceleration for self-hosted LLMs?

Coming soon - Get a detailed view of why an account is flagged as spam!

view details

This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

106

[D] Would a Tesla M40 provide cheap inference acceleration for self-hosted LLMs?

Post Flair (click to view more posts with a particular flair)

Discussion

Author Summary

roz303 is a male

Post Body

I'm extremely interested in running a self-hosted version of Vicuna-13b. So far, I've been able to get it to run at a very reasonable level of performance in the cloud with a Tesla T4 and V100 by using four and eight bit quantization. I'd love to bring it home and build a private server. However, those cards are mind-numbingly expensive. Although a 3090 has come down in price lately, $700 is still pretty steep. I was doing some research and it seems that a cuda compute capability of 5 or higher is the minimum required. At around $70ish on ebay ($100ish after a blower shroud; I'm aware these are datacenter cards), the Tesla M40 meets that requirement at CC 5.2 as well as having 24GB of VRAM. In theory it sounds like it'd be enough, right? Obviously I'm not going to be training or fine tuning LLMs with the card, but it sounds like it'd be enough for performing inference on the cheap and generating output of four or five tokens per second. What do you all think? Worth investing a few hundred dollars in building a little M40 rig, or would it still be too slow to be worth the trouble?

Comments

[not loaded or deleted]

[not loaded or deleted]

I don't want to be chained to the cloud, though. The whole point of a rig like this is a private, personal, self-hosted LLM. I don't want big corps to have access to it.

[not loaded or deleted]

[not loaded or deleted]

Oh yeah, I totally get the age being a major factor. The overall goal here is just to have a sub-$500 rig that doesn't take fifteen minutes or more to finish a prompt.

[not loaded or deleted]

[not loaded or deleted]

Could you possibly do me a favor and try running Vicuna-13b and telling me how many tokens per second you're able to get? This sounds pretty interesting.

[not loaded or deleted]

[not loaded or deleted]

I mean, my overall goal is three to five tokens per second; whether or not this requires a gpu is irrelevant. I really appreciate this! I'll take a look :)

[not loaded or deleted]

[not loaded or deleted]

What do you think would be the most cost-effective solution?

[not loaded or deleted]

[not loaded or deleted]

Seems to be the consensus is to experiment first before buying the hardware! Thankya ^{w^}

Author

Account Strength

100%

Account Age

9 years

Verified Email

Yes

Verified Flair

No

Total Karma

32,969

Link Karma

23,523

Comment Karma

9,115

Profile updated: 1 week ago

Posts updated: 2 months ago

roz303

Subreddit

r/MachineLearning

Post Details

They Are

a male

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.

Posted: 1 year ago
Reddit URL: View post on reddit.com
External URL: reddit.com/r/MachineLear...