Coming soon - Get a detailed view of why an account is flagged as spam!
view details

This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

58
To the person who spun up 50+ Amazon EC2 servers to evade the Pushshift rate limit -- please think of other clients
Post Body

Someone fired up 50 EC2 instances (or they were using lamda functions) and started hammering Pushshift with queries related to gaming laptops. It looks like they just wanted to get the history of comments mentioning certain gaming laptop terms very quickly.

Now, normally Pushshift gets between 25-50 requests per second (sometimes up to 100 during busy periods) and the aggregate egress bandwidth from the API server is usually around 5 MB/s. Your queries increased our egress bandwidth to over 50 MB/s (these were AFTER compression). Amazingly, the API supported the load although average latency increased to over a second for most responses.

Now, as a data hoarder, I get it -- sometimes you want data and you want it now. But please be mindful of the other clients that are respecting the rate limits. I generally don't care if someone uses a few extra IPs to increase their rate limit because in the end, it isn't really that big of a deal -- but if more people did what you did, it would cause the API to start choking on that load.

I generally never blacklist IPs for numerous reasons, but I do occasionally temporarily blacklist IPs if they continuously hammer the API even after receiving 429 errors (rate limit exceeded errors). I will also ban IPs that are making obvious malicious attempts to bring the API down. However, I try not to do this because I'm a big fan of open source and sharing data and I get it when a researcher needs to get data quickly.

In the future, if you need to grab data in bulk, you can download the monthly dumps or e-mail me and we can come up with an alternate plan. If you use a few extra IPs, I don't really care -- but 50 is excessive.

Just keep in mind this tool is for the community and when you grossly exceed the rate limit, you're causing others to suffer from increased latency and general slowness.

Thank you!

Ps: Once the new cluster comes online, I will probably increase the rate limit to 2-5 queries per second for everyone.

One more thing -- if you are using a script to fetch data from Pushshift, please check the response code and if you see a 429, please put a one second sleep in or a progressive backoff scheme. If you need help with the code, I'm happy to share code examples in Python.

Author
Account Strength
100%
Account Age
11 years
Verified Email
No
Verified Flair
No
Total Karma
143,730
Link Karma
34,810
Comment Karma
108,242
Profile updated: 2 days ago
Posts updated: 6 months ago

Subreddit

Post Details

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.
Posted
3 years ago