This post has been de-listed
It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.
Hi all, I'm new to python and webscraping, though I have a bit more experience with data analysis in Python. The reason I'm scraping is because I need to get data about available homes for sale (and their prices), as well as available homes for rent (and prices) from about 115 different zip codes. I then need to use that data together with some historical housing/rental data (which I already have) to do some analysis. So I've been looking for ways to scrape various websites and have run into trouble on most, having a hard time finding where to start. But realtor dotcom seems a bit easier to find what I need.
So, using ChatGPT and playing around in the developer tools / inspect mode on the browser, I've come up with a script that does a great job scraping a single page from realtor dotcom. However, I need to make pagination work on this so that I can scrape all the pages from a single zipcode at a time. (And I'd also love to work my script so that it runs through a list of zip codes and gets it all done at once. But I can try that after I get this pagination piece fixed.)
So here's what I've got that works great for a single page:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.realtor.com/apartments/72762"
# Set user-agent header to avoid bot detection
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
# Send HTTP GET request to the URL and get the HTML response
response = requests.get(url, headers=headers)
# Parse the HTML response using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
# Find the container for all the properties
container = soup.find('section', class_='PropertiesList_propertiesContainer__Vj44f PropertiesList_listViewGrid__OpnLO')
# Find all the listings in the container
listings = container.find_all('div', class_='BasePropertyCard_propertyCardWrap__pblQC')
# Loop through each listing and extract the details
data = []
for listing in listings:
address1 = listing.find('div', {'class': 'truncate-line', 'data-testid': 'card-address-1'}).text.strip()
address2 = listing.find('div', {'class': 'truncate-line', 'data-testid': 'card-address-2'}).text.strip()
price = listing.find('div', class_='price-wrapper').text.strip()
# Check if the element exists before calling the text() method
beds_elem = listing.find('li', {'data-testid': 'property-meta-beds'})
beds = beds_elem.text.strip() if beds_elem else ''
baths_elem = listing.find('li', {'data-testid': 'property-meta-baths'})
baths = baths_elem.text.strip() if baths_elem else ''
sqft_elem = listing.find('li', {'data-testid': 'property-meta-sqft'})
sqft = sqft_elem.text.strip() if sqft_elem else ''
# Append the extracted data to the list
data.append([address1, address2, price, beds, baths, sqft])
# Create a Pandas dataframe from the extracted data
df = pd.DataFrame(data, columns=['Address1', 'Address2', 'Price', 'Beds', 'Baths', 'Sqft'])
# Print the dataframe
print(df)
Here's the element that leads to the next page on the website:
<div aria-label="pagination" role="navigation" class="Paginatorstyles__StyledPaginator-rui__sc-1prqz4y-0 gtbvTq Paginator_paginator__D07tn">
<a class="item btn disabled" tabindex="-1" aria-label="Go to previous page" href="">
<svg data-testid="icon-caret-left" viewBox="0 0 512 512" style="display:inline-block;width:1em;height:1em;font-size:24px;color:inherit;fill:currentColor" aria-hidden="true" focusable="false">
<path d="m227 260 82 92V160l-82 92c-2 2-2 6 0 8z">
</path>
</svg>
Previous
</a>
<a class="item btn current" aria-current="true" aria-label="Current page, page 1" tabindex="-1" href="/apartments/72762">1</a>
<a class="item btn " aria-current="false" aria-label="Go to page 2" tabindex="0" href="/apartments/72762/pg-2">2</a>
<a class="item btn " tabindex="0" aria-label="Go to next page" href="/apartments/72762/pg-2">
Next
<svg data-testid="icon-caret-right" viewBox="0 0 512 512" style="display:inline-block;width:1em;height:1em;font-size:24px;color:inherit;fill:currentColor" aria-hidden="true" focusable="false">
<path d="m306 260-82 92V160l82 92c2 2 2 6 0 8z"></path></svg></a></div>
So after giving that to ChatGPT, it suggested the following code (which is completely beyond my understanding), and it didn't work:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url_template = "https://www.realtor.com/apartments/72762/pg-{}"
# Set user-agent header to avoid bot detection
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
data = []
page_num = 1
while True:
# Construct the URL for the current page
url = url_template.format(page_num)
# Send HTTP GET request to the URL and get the HTML response
response = requests.get(url, headers=headers)
# Parse the HTML response using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
# Find the container for all the properties
container = soup.find('section', class_='PropertiesList_propertiesContainer__Vj44f PropertiesList_listViewGrid__OpnLO')
# Find all the listings in the container
listings = container.find_all('div', class_='BasePropertyCard_propertyCardWrap__pblQC')
# Loop through each listing and extract the details
for listing in listings:
address1 = listing.find('div', {'class': 'truncate-line', 'data-testid': 'card-address-1'}).text.strip()
address2 = listing.find('div', {'class': 'truncate-line', 'data-testid': 'card-address-2'}).text.strip()
price = listing.find('div', class_='price-wrapper').text.strip()
# Check if the element exists before calling the text() method
beds_elem = listing.find('li', {'data-testid': 'property-meta-beds'})
beds = beds_elem.text.strip() if beds_elem else ''
baths_elem = listing.find('li', {'data-testid': 'property-meta-baths'})
baths = baths_elem.text.strip() if baths_elem else ''
sqft_elem = listing.find('li', {'data-testid': 'property-meta-sqft'})
sqft = sqft_elem.text.strip() if sqft_elem else ''
# Append the extracted data to the list
data.append([address1, address2, price, beds, baths, sqft])
# Find the "Next" button and check if it exists
next_button = soup.find('a', class_='item btn', href=True, text='Next')
if not next_button:
break
# Increment the page number and continue to the next page
page_num = 1
# Create a Pandas dataframe from the extracted data
df = pd.DataFrame(data, columns=['Address1', 'Address2', 'Price', 'Beds', 'Baths', 'Sqft'])
# Print the dataframe
print(df)
So could you guys help me understand how to do pagination on a page like this, so I can get it to spit out all the info at once? Thanks for your help!
EDIT: I'm not looking to pay someone to do this work for me. If anyone can help me understand how to do this (even if it's just to point me to helpful resources that explain that concept well), that's what I'm asking for. So please stop DM'ing me offering me to pay for your services, as I won't learn anything that way (and I didn't know this was a job search subreddit anyway).
Subreddit
Post Details
- Posted
- 1 year ago
- Reddit URL
- View post on reddit.com
- External URL
- reddit.com/r/webscraping...