Coming soon - Get a detailed view of why an account is flagged as spam!
view details

This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

33
Pushshift will be in a degraded state for at least a few days (Details inside)
Post Body

I had a very unfortunate thing happen while doing maintenance. There has been a server in the cluster that has gone down often so I was in the process of replacing the server with a new one and was getting ready to move the data. Unfortunately, I ran a command on the origin server which was meant for the destination server (basically I had switched to the wrong window).

This event caused the loss of approximately 20% of the cluster's data (data spread out from late 2018 to 2006). Data for 2019 is not affected and is still complete (recent data is actually replicated across more than one node for redundancy).

The raw data itself is backed up in multiple locations which means no data was permanently lost. However, it will take time to re-index the data and to bring the new node online and to reconcile / merge it with the existing shards that are operational.

This process will most likely take at least a few days and may extend up to a week. One of the ongoing major issues that has plagued this project is a lack of redundant nodes that would have prevented running in a degraded state. Eventually, I will have more servers to address this and issues like will be far less likely to occur.

If you are using the API for research, you will need to stop data collection until the data is restored. You can of course still download the raw data and use that for the time being.

A huge apology to everyone -- it was a horrible screw up. I tried my best to restore the data but unfortunately restoring data from an EXT4 filesystem is virtually impossible if data was written to the drive after the files were erased.

In a way, this is a mixed blessing because this will give me the opportunity to do a lot of reorganization and bring the new (faster) node online to replace the node that kept failing.

If you have any questions, feel free to ask them in the comments. I'm going to grab some sleep -- it was a hell of a ride over the past 2 hours.

To recap:

What doesn't work:

1) The data is now incomplete until the data can be restored to the new node.

2) There will be comments and submissions missing pre-2019.

3) Aggregations (including timeline aggregations) will under report activity for certain time ranges.

What does work:

1) The API is still ingesting new data without issues.

2) The API is still responsive to API requests.

3) Data is complete for this year (for both submissions and comments).

Author
Account Strength
100%
Account Age
11 years
Verified Email
No
Verified Flair
No
Total Karma
143,730
Link Karma
34,810
Comment Karma
108,242
Profile updated: 2 days ago
Posts updated: 6 months ago

Subreddit

Post Details

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.
Posted
5 years ago