Coming soon - Get a detailed view of why an account is flagged as spam!
view details

This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

36
This Research Paper claims to match 4 H100s with 50 RTX 3080s, what do you think?
Post Flair (click to view more posts with a particular flair)
Post Body

FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs

The rapid growth of memory and computation requirements of large language models (LLMs) has outpaced the development of hardware, hindering people who lack large-scale high-end GPUs from training or deploying LLMs. However, consumer-level GPUs, which constitute a larger market share, are typically overlooked in LLM due to their weaker computing performance, smaller storage capacity, and lower communication bandwidth. Additionally, users may have privacy concerns when interacting with remote LLMs. In this paper, we envision a decentralized system unlocking the potential vast untapped consumer-level GPUs in pre-training, inference and fine-tuning of LLMs with privacy protection. However, this system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity. To address these challenges, our system design incorporates: 1) a broker with backup pool to implement dynamic join and quit of computing providers; 2) task scheduling with hardware performance to improve system efficiency; 3) abstracting ML procedures into directed acyclic graphs (DAGs) to achieve model and task universality; 4) abstracting intermediate representation and execution planes to ensure compatibility of various devices and deep learning (DL) frameworks. Our performance analysis demonstrates that 50 RTX 3080 GPUs can achieve throughputs comparable to those of 4 H100 GPUs, which are significantly more expensive.

Limitations

Fault tolerance. In our current system design, the faulttolerant method is handled in a relatively simple manner (Section 3.2). The disconnection of compnodes interrupts DAG execution. Replacing disconnected peers with randomly sampled online peers can disrupt the load balance of scheduled tasks. Therefore, the costs of recovery, restart, and rescheduling need to be considered. Efficient fault tolerance schemes, including elastic training and swift and distributed checkpointing, will be explored and discussed in future work to improve the system’s fault tolerance capabilities.

Pipeline optimization. To enhance the system efficiency, our load-balance scheduling (Section 3.8) provides an initial step towards reducing the bubble time of pipeline parallelism. However, it is still questioned how to efficiently execute pipelines on compnodes.

Could we possibly compete with big companies using decentralized training to train LLMs? Democratic Machine Learning?

Author
Account Strength
70%
Account Age
2 years
Verified Email
Yes
Verified Flair
No
Total Karma
2,866
Link Karma
1,216
Comment Karma
1,650
Profile updated: 17 hours ago
Posts updated: 4 days ago

Subreddit

Post Details

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.
Posted
11 months ago