This post has been de-listed
It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.
I'm sorry I haven't been around much the past week or so -- I've been extremely busy managing the servers / storage and keeping things running as smooth as possible.
I wanted to take a few minutes to quickly update all of you on what's going on.
First, January Reddit data took a little longer than normal to get due to storage issues on my end -- The amount of data I am ingesting in total is now chewing up around one terabyte of storage each month or two. January comments are almost complete -- after the initial ingest, I then run a quick scan to double check missing ids to make the data as complete as possible. That normally takes around 1 or 2 days at the most. I should have the January comment data available by mid-week and then submissions will follow shortly thereafter.
The code to handle updating scores is complete -- the issue with deploying it is also related to storage issues. I need to double check how elasticsearch handles deleted documents (generally a merge operation has to be manually run to clean the indexes of deleted documents). Most data in Elasticsearch is immutable -- so when an update to a document is done, ES will actually mark the old data deleted and add a new document to the index -- even if only one field is updated. I have to make sure that updating score data on a mass scale is doable given my current storage capacity (it should be -- but I have to be extra careful not to run out of space or else new document creation would fail -- a rather bad thing).
On another front, Python has served me well in the past, but I'm now ingesting A LOT of data (not just Reddit, but Gab, Twitter, etc.). I've recently started learning Golang (it's a beautiful language in my opinion). Going from an interpreted object-oriented dynamically typed language (Python) to a compiled static-typed language that does not have objects (it has structs, methods, composition, etc. -- but in many ways it is much different than Python) is a lot of fun but also a different world from a programmatic standpoint.
One of the major features of Golang is that it is generally anywhere from 10-50 times faster than Python for most things and sometimes up to 75-100x faster for other things. Golang is comparable to Java in terms of raw speed and it will allow me to do things that just weren't very feasible with Python.
If anyone has any questions, feel free to ask -- I will be checking e-mails as much as possible and will try to respond as quickly as I can.
Thanks for your support -- all of you are amazing and I truly appreciate the donations and support that you have given me and this project!
All the best,
Jason
Subreddit
Post Details
- Posted
- 5 years ago
- Reddit URL
- View post on reddit.com
- External URL
- reddit.com/r/pushshift/c...