This post has been de-listed
It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.
First, I would like to extend a very heartfelt thank you to /u/PineappleFund for their generous contribution to Pushshift.io. Their contribution has enabled us to increase our services to the academic community and has cemented the long-term viability of the project.
Let's talk about Project Discovery!
With that said, I want to talk about a new project that will help take research to the next level and to get the community's opinions, suggestions and support for this new project. While the contribution made by Pineapple was extremely generous, in order to turn this project into a reality, we will need a bit more resources to see it through.
What is this all about?
A lot of academic institutions and data scientists use my monthly Reddit dumps to do amazing research. If you check out Google Scholar, you will see that two dozen papers have been published using this valuable data source.
There have been some interesting research on the data using machine learning. Unfortunately, the data size is huge and not everyone can afford the resources needed to do amazing research on this data. This is something I'd like to change with your help.
The new project is tentatively called Project Discovery and it involves bringing together the best data scientists, data visualization experts and machine learning / NLP researchers on Reddit and within the academic community at large.
My goal is to create a time-share server that allows many researchers and data scientists to analyze massive amounts of data using huge computer resources. The amount of funding needed to see this through will be in the range of $30-$40 thousand dollars, but Pineapplefund's contribution is enough to get the framework together and create an expandable server that can grow over time.
As many of you know, Nvidia recently released their Titan V GPU with Tensorflow processors. This card is capable of up to 110 teraflops of processing power. I want to include this in Project Discovery so that experts in machine learning can use this power to help analyze big data.
I am putting together an extremely powerful development server that will be shared with researchers. It will include the entire Reddit data corpus and also constantly update with new Reddit data from my ingest.
The end goal specs for this server include:
Dual Epyc CPU Motherboard that supports two AMD Epyc CPU's.
(2) AMD EPYC 7601 with 32 cores (64 Threads total). This would give a total of 128 threads for the system
Samsung Memory (32 or 64GB ECC DIMMS). It would be awesome to go with the 64GB DIMMS to support a total system memory of 2TB but my goal is to get at least a minimum of 1TB of memory which we can do with 32GB x 32 memory slots.
6 TB of NVMe storage utilizing Samsung's 2TB NVMe drives
12 TB of SSD storage utilizing 3 of Samsung's 4TB SSD drives.
48 TB of Platter storage for backup purposes.
(1) Nvidia Titan V GPU for tensorflow / machine learning projects.
I want to get feedback from the community for all of this. Especially the OS that will be used for the main server. I will have (2) smaller development servers with ~ 128 GB of memory to test code before running the code on the larger server. Each of the smaller development servers will have the complete Reddit corpus as well.
Operating System and Sysadmin support needed
The OS I am leaning towards is Ubuntu 16.04 LTS (at least until Ubuntu 18.04 LTS becomes available). All of the servers will have a 300Mbps fiber connection to the internet.
I will also need assistance from sysadmins with experience setting up multi-user environments and to make sure that we have proper security in place for Project Discovery.
As this project progresses into 2018, I will have a sign-up form for academic students and researchers so that they can get set up with an account and get access to all the various servers. We will have to come up with a plan for time-share when using the Titan V (once we are able to get that).
I may do a kickstarter (a 50K kickstarter would be enough to see this come alive in 2018) to get the rest of the needed funding. I will work with some of the other /r/datasets mods who have experience with machine learning and figure out the needed funds to make this amazing project come to life. Memory costs will be around $10k for 32 of the 32GB DIMMs. The two Epyc server CPU's will be ~ $10k for both of them. The storage aspects will be around $5-$10k total. And the Titan V card is around $3k. It will probably be beneficial to reach out to AMD to see if they would like to contribute to this project (along with Nvidia and Samsung).
The server can also be scaled up (use one CPU to begin and then scale up to 2).
If this can happen, it would be a huge huge huge benefit to the academic community at large. I know a lot of extremely smart data scientists who could do amazing things with these resources, so I will do my best to make this a reality for 2018.
Again, if you have any thoughts or ideas, please let me know!
Pre-installed Software
I know that software developers and researchers have different preferences on what type of database systems they would like to use. To that end, I would like to support as many open-source products as possible so that everyone has the tools they need for their project. Here is a short list of open-source products that will be available:
- PostgreSQL
- MySQL / MariaDB
- MongoDB
- CouchDB
- Hadoop
- Apache Derby
- Redis
- Elasticsearch (6.x)
- Apache Lucene
- Python3 with R, Numpy, etc. If you need a module, we'll install it.
- Python 2
- Perl
- Node.js
- Java
Here is a recap of the end goal for the system:
Attribute | Value |
---|---|
Min System Memory | 1,024 GB |
# of Physical Cores | 64 |
# of Threads | 128 |
Peak CPU Performance | 2,480 GigaFLOPS |
NVMe Storage | 6 TB |
SSD Storage | 12 TB |
Platter Backup Storage | 48 TB |
GPU Tensor Cores | 640 |
GPU CUDA Cores | 5120 |
GPU Clock | 1455Mhz |
GPU Frame Buffer | 12 GB HBM2 |
GPU Performance | 110 TeraFLOPS |
Auxiliary Development Servers (2)
Attribute | Value |
---|---|
CPU Type | Xeon D-1528 |
CPU Cores | 6 |
CPU Threads | 12 |
System Memory | 128 GB |
NVMe Storage | 1 TB |
Happy Holidays!
Subreddit
Post Details
- Posted
- 6 years ago
- Reddit URL
- View post on reddit.com
- External URL
- reddit.com/r/datasets/co...