This post has been de-listed
It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.
All submission and comment indices have been recovered. All data for all time periods should now be available.
For future reference, when querying Pushshift, it's a good idea to always add the metadata parameter (metadata=true) to ensure that all shards are returning data and that no shard data is missing from the response.
Here is what the appropriate section within the metadata key looks like:
"shards": {
"failed": 0,
"skipped": 0,
"successful": 4,
"total": 4
},
You should first make sure that failed is equal to 0. Also, make sure that the number of successful shards is equal to the total. It is possible for failed to return 0 but for the other two numbers not to match if a node falls out of the cluster.
If you check these two conditions, you are guaranteed to know whether or not you are only getting partial results or all available results.
If the successful shard count is less than the total shard count, what probably happened is that a node fell out of the cluster. This is usually always a temporary thing.
I am going to reach out to the maintainer of PRAW and request that a parameter be added that checks the shard counts and will either fail the request entirely or give a warning to the user. Off the top of my head, the parameter would be something like "allow_partial" where the user can set allow_partial=False if they want the call to fail entirely if any shards are unavailable.
Let me know if you have any questions. In the future, I am going to examine methods to auto-join a node that fell out of the cluster. What typically happens is that someone will run a very expensive query and one of the nodes will take an extremely long time to return results back to the master node. If this happens, the master node will assume that the other node took a vacation and will mark that node as being unavailable.
The version of Elasticsearch that the cluster in production is using is 5.6 which is an older version. In the next few weeks, I'm going to attempt to upgrade all nodes to 6.x and then to 7.x
When that happens, I'll put out an announcement since there will likely be a little downtime involved in that upgrade.
Subreddit
Post Details
- Posted
- 5 years ago
- Reddit URL
- View post on reddit.com
- External URL
- reddit.com/r/pushshift/c...