This post has been de-listed
It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.
As many of you who use the ingest for near real-time work, you have probably noticed the stream will fall behind up to a half hour or more (either comments, submissions or both) so I wanted to talk about this and possible paths forward.
The ingest script previously had issues keeping up with amount of Reddit posts and comments so a second ingest account was added to give a larger pool of API requests during any particular rate-limit window.
This appeared to work for awhile. Although there are two accounts giving twice the rate-limit window for reads, requests are still sent out in a serialized fashion so that if Reddit's API takes on average more than a second to respond, the ingest will fall behind no matter how large of a pool of requests it has.
There is really only one way to fix this at this point -- moving the ingest to a concurrent request scheme. Unfortunately, the logic to do that gets a bit messy if one wants to make a set of assurances to the end-user. One of these assurances in the past was that comments are added to the API in chronological order -- meaning if you ask for the latest X comments, you are guaranteed that no new comments will find there way in the time period you already asked for (at least, the time period up to the max created_utc).
In order to get concurrency working correctly and still give this assurance, comments and submissions have to collected and indexed in such a way as to basically be atomic blocks of data made in chronological order. Let me give an example of what this means and the challenged faced while doing this.
As I mentioned previously, the ingest script will (in a simplified explanation) ask for submissions 1-20 and comments 1-80 on one call, wait for the data and then when the call is returned, it takes the max seen id for submissions and comments as a marker for where to begin the next request.
Step one: Ingest asks for submissions 1-20 and comments 1-80 (ids)
Step two: Ingest receives 13 submissions with a max id of 15 and 72 comments with a max id of 74 (some ids are never available due to being part of a private sub, etc.)
Step three: Ingest asks for submissions 14-33 and comments 75-154 .... (and so on)
In a perfect spamless world where API's return data nearly instantly, this method is very robust and works over a long time-span without too many tweaks. With a concurrent model using two threads, it changes a bit:
Step one: Ingest worker #1 asks for submissions 1-20 and comments 1-80.
Step two: Ingest worker #2 asks for submissions 21-40 and comments 81-160.
At this point, a number of things can happen. Ideally the the data would be returned in order but that is something you cannot rely on in a concurrent model. It's just as likely worker #2 will return first and the program needs to wait for worker #1 to complete before putting the entire block of comments and submissions in order and making an atomic commit to Elasticsearch.
What happens if worker #1 has a long delay or fails completely? Then we hold onto the data that worker #2 retrieved, fire off another batch of concurrent requests where worker #1 asks again for the same ids while worker #2 continues advancing forward.
All of this requires logic to know when a full block has completed successfully and then placing that data into Redis as one uninterrupted block (chronologically). The ingest can't simply put worker #2's data into Redis right away because once data objects are placed in Redis, there is a processing script that constantly pulls the data out of Redis to do the actual index operations.
There are a few different methods of interleaving parallel requests -- doing progressive blocks like the previous example, having one worker ask for even numbered ids while the other asks for odd numbered, etc. The idea is to follow KISS as much as possible ("Keep it simple, stupid") and for this, I feel that using progressive blocks is the method that would require the least amount of logic.
In any event, I hope this gave you some insight into the problem. Also, if you have a better method that I have completely overlooked, I'm happy to discuss -- that's what makes communities like ours great!
Keeping the ingest latency low (under 5 seconds) is extremely important because part of the value of the Pushshift API is having access to near real-time data. With that said, I hope to have changes tested and completed within the next week.
Subreddit
Post Details
- Posted
- 5 years ago
- Reddit URL
- View post on reddit.com
- External URL
- reddit.com/r/pushshift/c...