Coming soon - Get a detailed view of why an account is flagged as spam!
view details

This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

3
How would you suggest I parse a Wikipedia data dump?
Post Body

So...a bit out of my depth here...but have downloaded a ~50GB XML data dump file of the ENTIRE english version of Wikipedia and am looking to pull some particular pages from it (specifically ALL the video game,films,music and books articles) and then process these pages by extracting certain text from them (specifically the infobox sections) and then somehow process this raw text and put it all into a comprehensive multimedia database(s)! Ambitious but that's my aim.

Obviously the dump file is SO HUGE that you cannot open it through normal means (ie. in excel)and since there appears to be no concrete method out there for processing it I was wondering which of the methods Reddit would suggest would be most suitable for what I aim to do???

Most of the methods out there appear to be various programming scripts instead of applications to parse the data dump and a lot are mentioned on MediaWiki. I'm familar with the workings of SQL, Python, Javascript, HTML and don't really mind what I use as long as it will get the job done! I just don't know where to start and what steps I have to do first???

I chose to download the dump file and then parse instead of just web scraping Wikipedia because I wasn't sure if it would be too burdensome on wikipedia's servers and didn't want to risk getting my IP banned. But if you know any other ways to get this information other than what I have mentioned please let me know???

Author
Account Strength
100%
Account Age
11 years
Verified Email
Yes
Verified Flair
No
Total Karma
7,620
Link Karma
311
Comment Karma
7,309
Profile updated: 3 days ago
Posts updated: 10 months ago

Subreddit

Post Details

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.
Posted
7 years ago