Coming soon - Get a detailed view of why an account is flagged as spam!
view details

This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

13
[Update] All indices have been recoved
Author Summary
Stuck_In_the_Matrix is in update
Post Body

All submission and comment indices have been recovered. All data for all time periods should now be available.

For future reference, when querying Pushshift, it's a good idea to always add the metadata parameter (metadata=true) to ensure that all shards are returning data and that no shard data is missing from the response.

Here is what the appropriate section within the metadata key looks like:


    "shards": {
        "failed": 0,
        "skipped": 0,
        "successful": 4,
        "total": 4
    },

You should first make sure that failed is equal to 0. Also, make sure that the number of successful shards is equal to the total. It is possible for failed to return 0 but for the other two numbers not to match if a node falls out of the cluster.

If you check these two conditions, you are guaranteed to know whether or not you are only getting partial results or all available results.

If the successful shard count is less than the total shard count, what probably happened is that a node fell out of the cluster. This is usually always a temporary thing.

I am going to reach out to the maintainer of PRAW and request that a parameter be added that checks the shard counts and will either fail the request entirely or give a warning to the user. Off the top of my head, the parameter would be something like "allow_partial" where the user can set allow_partial=False if they want the call to fail entirely if any shards are unavailable.

Let me know if you have any questions. In the future, I am going to examine methods to auto-join a node that fell out of the cluster. What typically happens is that someone will run a very expensive query and one of the nodes will take an extremely long time to return results back to the master node. If this happens, the master node will assume that the other node took a vacation and will mark that node as being unavailable.

The version of Elasticsearch that the cluster in production is using is 5.6 which is an older version. In the next few weeks, I'm going to attempt to upgrade all nodes to 6.x and then to 7.x

When that happens, I'll put out an announcement since there will likely be a little downtime involved in that upgrade.

Author
Account Strength
100%
Account Age
11 years
Verified Email
No
Verified Flair
No
Total Karma
143,730
Link Karma
34,810
Comment Karma
108,242
Profile updated: 2 days ago
Posts updated: 6 months ago

Subreddit

Post Details

Location
We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.
Posted
5 years ago