How would you suggest I parse a Wikipedia data dump?

This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

Post Body

So...a bit out of my depth here...but have downloaded a ~50GB XML data dump file of the ENTIRE english version of Wikipedia and am looking to pull some particular pages from it (specifically ALL the video game,films,music and books articles) and then process these pages by extracting certain text from them (specifically the infobox sections) and then somehow process this raw text and put it all into a comprehensive multimedia database(s)! Ambitious but that's my aim.

Obviously the dump file is SO HUGE that you cannot open it through normal means (ie. in excel)and since there appears to be no concrete method out there for processing it I was wondering which of the methods Reddit would suggest would be most suitable for what I aim to do???

Most of the methods out there appear to be various programming scripts instead of applications to parse the data dump and a lot are mentioned on MediaWiki. I'm familar with the workings of SQL, Python, Javascript, HTML and don't really mind what I use as long as it will get the job done! I just don't know where to start and what steps I have to do first???

I chose to download the dump file and then parse instead of just web scraping Wikipedia because I wasn't sure if it would be too burdensome on wikipedia's servers and didn't want to risk getting my IP banned. But if you know any other ways to get this information other than what I have mentioned please let me know???

Author

Account Strength

100%

Account Age

11 years

Verified Email

Yes

Verified Flair

Total Karma

7,620

Link Karma

311

Comment Karma

7,309

Profile updated: 3 days ago

Posts updated: 10 months ago

Monkfish10

Subreddit

r/AskProgramming

Post Details

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.

Posted: 7 years ago
Reddit URL: View post on reddit.com
External URL: reddit.com/r/AskProgramm...