This post has been de-listed
It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.
What does this mean for the project? Well, for one, I'll be able to add another 4-5 nodes for the ES cluster (each with 128 GB of ram and an Intel Octane 905p drive). I'm going to get one to test but the Octane 905p drive should substantially increase the speed of random record retrievals -- on the order of 2-5x faster than the current setup using NVMe drives.
Intel Octanes are expensive, but their 10 microsecond latency makes low queue depth requests extraordinarily fast. More importantly, the additional servers will now allow for replica shards that can serve as failovers in the event that a server drops out of the cluster.
Currently, a server will drop out of the cluster on rare occasions -- but it does cause issues because it takes data with it and will cause records to go missing. With replica nodes, a server failure won't really affect the API except for possibly increasing the latency a bit for certain requests.
The specs for the new servers will be a substantial increase in processing power and storage speed. This is the motherboard for the new nodes. This will increase the available threads from the current 12 to 16 -- plus this CPU is faster in general than the 1541 Xeon-D's currently being used.
Each node will be populated with 128 GB of ram (Samsung 32GB DIMMS have fallen to around $200 each). Each node will also contain one of these. While expensive, they will reduce the latency to less than half of the current latency for random record retrievals within the cluster. It should be possible to do aggregations on ALL Reddit comments and submissions in under ten seconds (currently these aggregations will usually time out).
I wanted to share the news since it is exciting and wanted to get any feedback / suggestions from other techies to get a feel for the consensus for these new hardware decisions.
With these new additions, I'm also going to be looking at purchasing a 4U server with two Epyc CPU's and one or two GPUs. The total RAM for that system will be either 256 GB or 512 GB and it will power new endpoints that use machine learning / deep learning routines. My eventual goal is to put up some new endpoints to do image / meme detection, language detection (categorizing comments by language) and to translate comments to English so that searches for specific things can also match up against their equivalent in other languages. There is a project called Doppler (backed by MIT's Media Cloud and Harvard's Shorenstein Center that will do some exciting stuff) that works with image data as well. From my current knowledge, this project is completely open-sourced but still under active development.
These new additions will increase the total RAM across all Pushshift servers to over 1.5 terabytes and increase fast storage from the current 8TB to around 16TB. This will also allow Pushshift to add other data like weather, real-time seismograph data, Geo-spacial data, GOES satellite data, real-time solar flux data, etc.). I would eventually like to get to the point where Pushshift could be a one stop shop for an array of real-time scientific data. Imagine being able to get seismograph data for a recent earthquake while simultaneously being able to query multiple social media platforms to see how it propagated based on comments from sources like Reddit, Twitter, etc. Also, with eventual webhook capability, people could subscribe to get alerts based on specific real-time data that reaches some pre-determined threshold (e.g. setting up a webhook to get alerts for earthquakes over a specific magnitude within a 250 mile radius of specific coordinates, etc. -- within seconds)
Thank you! As always, I want to note that this project would not have been possible without the many donations and other contributions from all of you amazing people!
Subreddit
Post Details
- Posted
- 5 years ago
- Reddit URL
- View post on reddit.com
- External URL
- reddit.com/r/pushshift/c...