Coming soon - Get a detailed view of why an account is flagged as spam!
view details

This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

0
[Help Request] Missing elements on a page
Author Summary
HasuTeras is in HELP Request
Post Body

Hi all,

I'm pretty new to webscraping generally. Also, I don't really have any background in webdev or anything like that (I'm an economist), so if I misname anything or don't refer to things correctly, I apologise. I've been trying to do some data collection for my PhD work and am running into an issue that I can't seem to get my head around. I've consulted Stackoverflow and chatGPT but can't seem to make any headway.

The crux of the issue is: I'm trying to scrape a website of services that people offer, that contains their user profiles and a host of information related to them (location, price per hour of service, user ratings etc.). For reference, I'm doing it in R Selenium as the page is dynamically generated and scraping involves cycling through different pages on the website, and selecting/clicking on bits of the page. My background is in R, because I mainly do statistics/econometrics work - though I can limp along in Python as well.

The issue I'm running into is:

  • The page can load a max. of 1,000 query results loading 20 at a time. Great. I write a script that takes the n number of returned queries and triggers the dynamic scroll the appropriate number of times to load all queries.
  • I write up a couple of lines that find the elements for userID. I extract them. I get a vector of N userIDs.
  • I then write up a couple of lines that find the elements for the price per hour. I extract them. I get a vector of prices.

However... not all users provide a price per hour. So now, I end up with a vector of 1,000 user names, but only a vector 995 prices, and I have no idea where they become misaligned. When I manually check to see the profiles that don't contain prices, they simply do not contain the HTML class that relates to prices (as if it were empty, I'd just write something to impute an NA).

My next approach was to try and forloop over the profiles themselves. So I run a findElement function for the profiles, assign to an object, and then for every profile in profiles I do a findElement for the prices and check length of HTML class 'price' and if >0 getElementText, if <= 0, then NA. The problem with this, is that for some reason I get ALL the text from the profile, not just the price. This sucks for 2 reasons, a) it is inefficient and runs very slowly (and I want to scrape the entire site, so this wouldn't be feasible) and b) it then requires a lot of data processing at the end to extract specific price information from the wall of text it has extracted.

Does anyone know of any workarounds for this kind of thing?

Much obliged!

Author
Account Strength
100%
Account Age
11 years
Verified Email
Yes
Verified Flair
No
Total Karma
138,805
Link Karma
2,668
Comment Karma
135,004
Profile updated: 5 hours ago
Posts updated: 4 months ago

Subreddit

Post Details

Location
We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.
Posted
10 months ago