This post has been de-listed
It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.
[2019-07-14 23:30 ET] I am now loading in all submissions from 2010 and earlier. I expect those submissions to be recovered later tonight. I'll update this submission with occasional status updates on the recovery process. There is some good news from this. The updated data will have more complete json data for very old submissions!
Since I have to reload the data, I may as well make one improvement. I'll be adding the author id and author created_utc to submissions so that submissions can be filtered based on when the author's account was created.
[2019-07-15 21:35 ET] Submissions for the years 2005 thru 2012 (inclusive) have been successfully reloaded. I am merging the segments now and will have the new index up so that it is available via the API in the next hour. Submissions for 2013 and 2014 are processing now and should be completed by tomorrow. I made some optimizations when rebuilding the indexes so hopefully the API is more responsive once the reload of data is complete.
[2019-07-15 22:30 ET] Submissions for years 2005 thru 2012 (inclusive) are now merged and available via the API.
[2019-07-16 13:00 ET] Submissions for the year 2013 have been re-indexed and are now available. Submissions for 2014 should be completed in a few hours. Submissions for 2015 and 2016 should be completed by tomorrow afternoon.
[2019-07-16 18:55 ET] Submissions for the year 2014 have been re-indexed and are now available. Submissions for 2015 should be comp
[2019-07-17 01:50 ET] Submissions for the year 2015 have been re-indexed.
[2019-07-17 02:15 ET] Submissions for 2016 are being re-indexed. I anticipate 2016 to be complete in around 4-5 hours. Although 2016 was not affected, I am still loading in the author_created_utc information. The only gap left for submissions is 2017-01-01 through 2017-08-01 (inclusive). I have started that concurrently with the 2016 re-index and expect that to be completed in around 8-10 hours. While looking at the data, I noticed that the submissions for 2018 need to have their score updated along with gildings. Since I already have the data from the monthly dumps, I will also process 2018 to update the data and also include the author_created_utc information. I expect submissions to be completed by late Friday / early Saturday. At that point, I will move to comments. There is also a lot of score data that needs to be updated for comments. I anticipate all comments to be restored / updated by the following weekend.
(Submissions for 2016 were not affected but I am going to reprocess them to add the author_created_utc fields and also to optimize the indices which will decrease search latency. I will also reprocess 2017 and 2018 as there are many submissions that need their score / gilding data updated.)
(I noticed I have enough capacity to add replicas for submissions. Once all submissions have been restored, I will add a node just for submission replicas. This will add redundancy for all submissions and also add more search capacity for all submissions. This should drastically improve uptime for submissions so that 100% of all submissions are available all the time -- even if a node goes offline for whatever reason. This will also help load balance searches by distributing submission searches across the replicas. Search latency will improve and provide for faster searches and aggregations.)
(Also note that this entire process will need to be repeated at some future date when the API is moved to v2.0. However, the process will be a bit more orderly than this one.) 😬
Subreddit
Post Details
- Posted
- 5 years ago
- Reddit URL
- View post on reddit.com
- External URL
- reddit.com/r/pushshift/c...