This post has been de-listed
It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.
Hi all,
I'm pretty new to webscraping generally. Also, I don't really have any background in webdev or anything like that (I'm an economist), so if I misname anything or don't refer to things correctly, I apologise. I've been trying to do some data collection for my PhD work and am running into an issue that I can't seem to get my head around. I've consulted Stackoverflow and chatGPT but can't seem to make any headway.
The crux of the issue is: I'm trying to scrape a website of services that people offer, that contains their user profiles and a host of information related to them (location, price per hour of service, user ratings etc.). For reference, I'm doing it in R Selenium as the page is dynamically generated and scraping involves cycling through different pages on the website, and selecting/clicking on bits of the page. My background is in R, because I mainly do statistics/econometrics work - though I can limp along in Python as well.
The issue I'm running into is:
- The page can load a max. of 1,000 query results loading 20 at a time. Great. I write a script that takes the n number of returned queries and triggers the dynamic scroll the appropriate number of times to load all queries.
- I write up a couple of lines that find the elements for userID. I extract them. I get a vector of N userIDs.
- I then write up a couple of lines that find the elements for the price per hour. I extract them. I get a vector of prices.
However... not all users provide a price per hour. So now, I end up with a vector of 1,000 user names, but only a vector 995 prices, and I have no idea where they become misaligned. When I manually check to see the profiles that don't contain prices, they simply do not contain the HTML class that relates to prices (as if it were empty, I'd just write something to impute an NA).
My next approach was to try and forloop over the profiles themselves. So I run a findElement function for the profiles, assign to an object, and then for every profile in profiles I do a findElement for the prices and check length of HTML class 'price' and if >0 getElementText, if <= 0, then NA. The problem with this, is that for some reason I get ALL the text from the profile, not just the price. This sucks for 2 reasons, a) it is inefficient and runs very slowly (and I want to scrape the entire site, so this wouldn't be feasible) and b) it then requires a lot of data processing at the end to extract specific price information from the wall of text it has extracted.
Does anyone know of any workarounds for this kind of thing?
Much obliged!
Subreddit
Post Details
- Posted
- 10 months ago
- Reddit URL
- View post on reddit.com
- External URL
- reddit.com/r/webscraping...