[Help Request] Missing elements on a page

This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

Author Summary

HasuTeras is in HELP Request

Post Body

Hi all,

I'm pretty new to webscraping generally. Also, I don't really have any background in webdev or anything like that (I'm an economist), so if I misname anything or don't refer to things correctly, I apologise. I've been trying to do some data collection for my PhD work and am running into an issue that I can't seem to get my head around. I've consulted Stackoverflow and chatGPT but can't seem to make any headway.

The crux of the issue is: I'm trying to scrape a website of services that people offer, that contains their user profiles and a host of information related to them (location, price per hour of service, user ratings etc.). For reference, I'm doing it in R Selenium as the page is dynamically generated and scraping involves cycling through different pages on the website, and selecting/clicking on bits of the page. My background is in R, because I mainly do statistics/econometrics work - though I can limp along in Python as well.

The issue I'm running into is:

The page can load a max. of 1,000 query results loading 20 at a time. Great. I write a script that takes the n number of returned queries and triggers the dynamic scroll the appropriate number of times to load all queries.
I write up a couple of lines that find the elements for userID. I extract them. I get a vector of N userIDs.
I then write up a couple of lines that find the elements for the price per hour. I extract them. I get a vector of prices.

However... not all users provide a price per hour. So now, I end up with a vector of 1,000 user names, but only a vector 995 prices, and I have no idea where they become misaligned. When I manually check to see the profiles that don't contain prices, they simply do not contain the HTML class that relates to prices (as if it were empty, I'd just write something to impute an NA).

My next approach was to try and forloop over the profiles themselves. So I run a findElement function for the profiles, assign to an object, and then for every profile in profiles I do a findElement for the prices and check length of HTML class 'price' and if >0 getElementText, if <= 0, then NA. The problem with this, is that for some reason I get ALL the text from the profile, not just the price. This sucks for 2 reasons, a) it is inefficient and runs very slowly (and I want to scrape the entire site, so this wouldn't be feasible) and b) it then requires a lot of data processing at the end to extract specific price information from the wall of text it has extracted.

Does anyone know of any workarounds for this kind of thing?

Much obliged!

Author

Account Strength

100%

Account Age

11 years

Verified Email

Yes

Verified Flair

Total Karma

138,805

Link Karma

2,668

Comment Karma

135,004

Profile updated: 5 hours ago

Posts updated: 4 months ago

HasuTeras

Subreddit

r/webscraping

Post Details

Location

HELP Request

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.

Posted: 10 months ago
Reddit URL: View post on reddit.com
External URL: reddit.com/r/webscraping...