Taking research to the next level and a special thanks to the Pineapplefund!

This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

Post Body

First, I would like to extend a very heartfelt thank you to /u/PineappleFund for their generous contribution to Pushshift.io. Their contribution has enabled us to increase our services to the academic community and has cemented the long-term viability of the project.

Let's talk about Project Discovery!

With that said, I want to talk about a new project that will help take research to the next level and to get the community's opinions, suggestions and support for this new project. While the contribution made by Pineapple was extremely generous, in order to turn this project into a reality, we will need a bit more resources to see it through.

What is this all about?

A lot of academic institutions and data scientists use my monthly Reddit dumps to do amazing research. If you check out Google Scholar, you will see that two dozen papers have been published using this valuable data source.

There have been some interesting research on the data using machine learning. Unfortunately, the data size is huge and not everyone can afford the resources needed to do amazing research on this data. This is something I'd like to change with your help.

The new project is tentatively called Project Discovery and it involves bringing together the best data scientists, data visualization experts and machine learning / NLP researchers on Reddit and within the academic community at large.

My goal is to create a time-share server that allows many researchers and data scientists to analyze massive amounts of data using huge computer resources. The amount of funding needed to see this through will be in the range of $30-$40 thousand dollars, but Pineapplefund's contribution is enough to get the framework together and create an expandable server that can grow over time.

As many of you know, Nvidia recently released their Titan V GPU with Tensorflow processors. This card is capable of up to 110 teraflops of processing power. I want to include this in Project Discovery so that experts in machine learning can use this power to help analyze big data.

I am putting together an extremely powerful development server that will be shared with researchers. It will include the entire Reddit data corpus and also constantly update with new Reddit data from my ingest.

The end goal specs for this server include:

Dual Epyc CPU Motherboard that supports two AMD Epyc CPU's.
(2) AMD EPYC 7601 with 32 cores (64 Threads total). This would give a total of 128 threads for the system
Samsung Memory (32 or 64GB ECC DIMMS). It would be awesome to go with the 64GB DIMMS to support a total system memory of 2TB but my goal is to get at least a minimum of 1TB of memory which we can do with 32GB x 32 memory slots.
6 TB of NVMe storage utilizing Samsung's 2TB NVMe drives
12 TB of SSD storage utilizing 3 of Samsung's 4TB SSD drives.
48 TB of Platter storage for backup purposes.
(1) Nvidia Titan V GPU for tensorflow / machine learning projects.

I want to get feedback from the community for all of this. Especially the OS that will be used for the main server. I will have (2) smaller development servers with ~ 128 GB of memory to test code before running the code on the larger server. Each of the smaller development servers will have the complete Reddit corpus as well.

Operating System and Sysadmin support needed

The OS I am leaning towards is Ubuntu 16.04 LTS (at least until Ubuntu 18.04 LTS becomes available). All of the servers will have a 300Mbps fiber connection to the internet.

I will also need assistance from sysadmins with experience setting up multi-user environments and to make sure that we have proper security in place for Project Discovery.

As this project progresses into 2018, I will have a sign-up form for academic students and researchers so that they can get set up with an account and get access to all the various servers. We will have to come up with a plan for time-share when using the Titan V (once we are able to get that).

I may do a kickstarter (a 50K kickstarter would be enough to see this come alive in 2018) to get the rest of the needed funding. I will work with some of the other /r/datasets mods who have experience with machine learning and figure out the needed funds to make this amazing project come to life. Memory costs will be around $10k for 32 of the 32GB DIMMs. The two Epyc server CPU's will be ~ $10k for both of them. The storage aspects will be around $5-$10k total. And the Titan V card is around $3k. It will probably be beneficial to reach out to AMD to see if they would like to contribute to this project (along with Nvidia and Samsung).

The server can also be scaled up (use one CPU to begin and then scale up to 2).

If this can happen, it would be a huge huge huge benefit to the academic community at large. I know a lot of extremely smart data scientists who could do amazing things with these resources, so I will do my best to make this a reality for 2018.

Again, if you have any thoughts or ideas, please let me know!

Pre-installed Software

I know that software developers and researchers have different preferences on what type of database systems they would like to use. To that end, I would like to support as many open-source products as possible so that everyone has the tools they need for their project. Here is a short list of open-source products that will be available:

PostgreSQL
MySQL / MariaDB
MongoDB
CouchDB
Hadoop
Apache Derby
Redis
Elasticsearch (6.x)
Apache Lucene
Python3 with R, Numpy, etc. If you need a module, we'll install it.
Python 2
Perl
Node.js
Java

Here is a recap of the end goal for the system:

Attribute	Value
Min System Memory	1,024 GB
# of Physical Cores	64
# of Threads	128
Peak CPU Performance	2,480 GigaFLOPS
NVMe Storage	6 TB
SSD Storage	12 TB
Platter Backup Storage	48 TB
GPU Tensor Cores	640
GPU CUDA Cores	5120
GPU Clock	1455Mhz
GPU Frame Buffer	12 GB HBM2
GPU Performance	110 TeraFLOPS

Auxiliary Development Servers (2)

Attribute	Value
CPU Type	Xeon D-1528
CPU Cores	6
CPU Threads	12
System Memory	128 GB
NVMe Storage	1 TB

Happy Holidays!

Author

Account Strength

100%

Account Age

11 years

Verified Email

Verified Flair

Total Karma

143,730

Link Karma

34,810

Comment Karma

108,242

Profile updated: 1 day ago

Posts updated: 6 months ago

Stuck_In_the_Matrix

pushshift.io

Subreddit

r/datasets

Post Details

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.

Posted: 6 years ago
Reddit URL: View post on reddit.com
External URL: reddit.com/r/datasets/co...