This post has been de-listed
It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.
Someone fired up 50 EC2 instances (or they were using lamda functions) and started hammering Pushshift with queries related to gaming laptops. It looks like they just wanted to get the history of comments mentioning certain gaming laptop terms very quickly.
Now, normally Pushshift gets between 25-50 requests per second (sometimes up to 100 during busy periods) and the aggregate egress bandwidth from the API server is usually around 5 MB/s. Your queries increased our egress bandwidth to over 50 MB/s (these were AFTER compression). Amazingly, the API supported the load although average latency increased to over a second for most responses.
Now, as a data hoarder, I get it -- sometimes you want data and you want it now. But please be mindful of the other clients that are respecting the rate limits. I generally don't care if someone uses a few extra IPs to increase their rate limit because in the end, it isn't really that big of a deal -- but if more people did what you did, it would cause the API to start choking on that load.
I generally never blacklist IPs for numerous reasons, but I do occasionally temporarily blacklist IPs if they continuously hammer the API even after receiving 429 errors (rate limit exceeded errors). I will also ban IPs that are making obvious malicious attempts to bring the API down. However, I try not to do this because I'm a big fan of open source and sharing data and I get it when a researcher needs to get data quickly.
In the future, if you need to grab data in bulk, you can download the monthly dumps or e-mail me and we can come up with an alternate plan. If you use a few extra IPs, I don't really care -- but 50 is excessive.
Just keep in mind this tool is for the community and when you grossly exceed the rate limit, you're causing others to suffer from increased latency and general slowness.
Thank you!
Ps: Once the new cluster comes online, I will probably increase the rate limit to 2-5 queries per second for everyone.
One more thing -- if you are using a script to fetch data from Pushshift, please check the response code and if you see a 429, please put a one second sleep in or a progressive backoff scheme. If you need help with the code, I'm happy to share code examples in Python.
Subreddit
Post Details
- Posted
- 3 years ago
- Reddit URL
- View post on reddit.com
- External URL
- reddit.com/r/pushshift/c...