[Update] All indices have been recoved

This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

Author Summary

Stuck_In_the_Matrix is in update

Post Body

All submission and comment indices have been recovered. All data for all time periods should now be available.

For future reference, when querying Pushshift, it's a good idea to always add the metadata parameter (metadata=true) to ensure that all shards are returning data and that no shard data is missing from the response.

Here is what the appropriate section within the metadata key looks like:

    "shards": {
        "failed": 0,
        "skipped": 0,
        "successful": 4,
        "total": 4
    },

You should first make sure that failed is equal to 0. Also, make sure that the number of successful shards is equal to the total. It is possible for failed to return 0 but for the other two numbers not to match if a node falls out of the cluster.

If you check these two conditions, you are guaranteed to know whether or not you are only getting partial results or all available results.

If the successful shard count is less than the total shard count, what probably happened is that a node fell out of the cluster. This is usually always a temporary thing.

I am going to reach out to the maintainer of PRAW and request that a parameter be added that checks the shard counts and will either fail the request entirely or give a warning to the user. Off the top of my head, the parameter would be something like "allow_partial" where the user can set allow_partial=False if they want the call to fail entirely if any shards are unavailable.

Let me know if you have any questions. In the future, I am going to examine methods to auto-join a node that fell out of the cluster. What typically happens is that someone will run a very expensive query and one of the nodes will take an extremely long time to return results back to the master node. If this happens, the master node will assume that the other node took a vacation and will mark that node as being unavailable.

The version of Elasticsearch that the cluster in production is using is 5.6 which is an older version. In the next few weeks, I'm going to attempt to upgrade all nodes to 6.x and then to 7.x

When that happens, I'll put out an announcement since there will likely be a little downtime involved in that upgrade.

Author

Account Strength

100%

Account Age

11 years

Verified Email

Verified Flair

Total Karma

143,730

Link Karma

34,810

Comment Karma

108,242

Profile updated: 2 days ago

Posts updated: 6 months ago

Stuck_In_the_Matrix

Subreddit

r/pushshift

Post Details

Location

update

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.

Posted: 5 years ago
Reddit URL: View post on reddit.com
External URL: reddit.com/r/pushshift/c...