Growing pains and moving forward to bigger and better performance

This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

Post Body

Let me first start off by saying that I honestly never anticipated that the Pushshift API would grow to see up to 400 million API hits a month when I first started out. I anticipated growth, but not at the level the API has seen over the past few years.

Lately, the production API has just about reached its limits in the number of requests it receives and the size of data within the current cluster. Believe me, 5xx errors and occasional data gaps frustrate me as much as it does everyone else who depends on the API for accurate and reliable data.

The current production API is using an older version of Elasticsearch and the number of nodes in the cluster isn't sufficient to keep up with demand. That is unacceptable for me because I want people to be able to depend on the API for accurate data.

I have rewritten the ingest script to be far more robust than the current one feeding the production API (the new ingest script is feeding http://beta.pushshift.io/redoc) This is the current plan going forward:

1) I'll be adding 5-10 additional nodes (servers) to the cluster to bring the cluster up to around 16 nodes in total. The new cluster will have at least one replica shard for each primary shard. What that means is that if there is a node failure, the API will still return complete results for a query.

2) The new ingest script will be put into production to feed data into the new cluster. There will also be better monitoring scripts to verify the integrity and completeness of the data. With the additional logic for the new ingest script and the methodology it uses to collect data, data gaps would only occur if there was some unforeseen bug / error with Elasticsearch indexing (which there really shouldn't be). In the event that a data gap is found, the monitor script will detect it and correct it.

3) The index methodology will create a new index for each new calendar month. I'll incorporate additional logic in the API to only scan the indexes needed for a particular query that restricts a search by time. This will increase performance because Elasticsearch won't have to touch shards that don't contain data within the time range searched.

4) I'll be creating a monitor page that people can visit to see the overall health of the cluster and if there are any known problems, the monitor page will list them along with an estimate on how long it will take to fix the problem.

5) Removal requests will be made easier by allowing users who still have an active Reddit account to simply log in via their Reddit account to prove ownership and then be given the ability to remove their data from the cluster. This will automate and speed up removal requests for users who are concerned about their privacy. This page will also allow a user to download all of their comments and posts if they choose to do so before removing their data.

When we start the process of upgrading the cluster and moving / re-indexing data into the new cluster, there may be a window of time where old data is unavailable until all the data has been migrated. When that time comes, I'll let everyone know about the situation and what to expect. The goal is to make the transition as painless as possible for everyone.

Also, we will soon be introducing keys for users so that we can better track usage and to make sure that no one person makes so many expensive requests that it starts to hurt the performance of the cluster. When that time comes, I'll make another post explaining the process of signing up for a key and how to use the key to make requests.

As always, I appreciate all the feedback from users. I currently don't spend much time on Reddit, but you can e-mail or ping me via Twitter if needed. Again, I appreciate any alerts from people who discover issues with the API.

Thanks to everyone who currently supports Pushshift and I hope to get all of the above completed before the new year. We will also be adding additional data sources and new API endpoints for researchers to use to collect social media data from other sources besides Reddit.

Thank you and please stay safe and healthy over the holidays!

Jason

Author

Account Strength

100%

Account Age

11 years

Verified Email

Verified Flair

Total Karma

143,730

Link Karma

34,810

Comment Karma

108,242

Profile updated: 2 days ago

Posts updated: 6 months ago

Stuck_In_the_Matrix

Subreddit

r/pushshift

Post Details

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.

Posted: 4 years ago
Reddit URL: View post on reddit.com
External URL: reddit.com/r/pushshift/c...