This post has been de-listed
It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.
So...a bit out of my depth here...but have downloaded a ~50GB XML data dump file of the ENTIRE english version of Wikipedia and am looking to pull some particular pages from it (specifically ALL the video game,films,music and books articles) and then process these pages by extracting certain text from them (specifically the infobox sections) and then somehow process this raw text and put it all into a comprehensive multimedia database(s)! Ambitious but that's my aim.
Obviously the dump file is SO HUGE that you cannot open it through normal means (ie. in excel)and since there appears to be no concrete method out there for processing it I was wondering which of the methods Reddit would suggest would be most suitable for what I aim to do???
Most of the methods out there appear to be various programming scripts instead of applications to parse the data dump and a lot are mentioned on MediaWiki. I'm familar with the workings of SQL, Python, Javascript, HTML and don't really mind what I use as long as it will get the job done! I just don't know where to start and what steps I have to do first???
I chose to download the dump file and then parse instead of just web scraping Wikipedia because I wasn't sure if it would be too burdensome on wikipedia's servers and didn't want to risk getting my IP banned. But if you know any other ways to get this information other than what I have mentioned please let me know???
Subreddit
Post Details
- Posted
- 7 years ago
- Reddit URL
- View post on reddit.com
- External URL
- reddit.com/r/AskProgramm...