This post has been de-listed
It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.
3
Remove rows that are too much alike not to be duplicates
Post Body
I have a dataset of real estate advertisements. Several of the lines are about the same real estate property so it's full of duplicates that aren't exactly the same. What would be the best methods to remove rows that are too much alike not to be duplicates?
It looks like this :
ID URL CRAWL_SOURCE PROPERTY_TYPE NEW_BUILD DESCRIPTION IMAGES SURFACE LAND_SURFACE BALCONY_SURFACE ... DEALER_NAME DEALER_TYPE CITY_ID CITY ZIP_CODE DEPT_CODE PUBLICATION_START_DATE PUBLICATION_END_DATE LAST_CRAWL_DATE LAST_PRICE_DECREASE_DATE
0 22c05930-0eb5-11e7-b53d-bbead8ba43fe http://www.avendrealouer.fr/location/levallois... A_VENDRE_A_LOUER APARTMENT False Au rez de chaussée d'un bel immeuble récent,... ["https://cf-medias.avendrealouer.fr/image/_87... 72.0 NaN NaN ... Lamirand Et Associes AGENCY 54178039 Levallois-Perret 92300.0 92 2017-03-22T04:07:56.095 NaN 2017-04-21T18:52:35.733 NaN
1 8d092fa0-bb99-11e8-a7c9-852783b5a69d https://www.bienici.com/annonce/ag440414-16547... BIEN_ICI APARTMENT False Je vous propose un appartement dans la rue Col... ["http://photos.ubiflow.net/440414/165474561/p... 48.0 NaN NaN ... Proprietes Privees MANDATARY 54178039 Levallois-Perret 92300.0 92 2018-09-18T11:04:44.461 NaN 2019-06-06T10:08:10.89 2018-09-25
So far I tried to compare the description :
df['is_duplicated'] = df.duplicated(['DESCRIPTION'])
And to compare the array of photos :
def image_similarity(imageAurls,imageBurls):
imageAurls = ast.literal_eval(imageAurls)
imageBurls = ast.literal_eval(imageBurls)
for urlA in imageAurls:
responseA = requests.get(urlA)
imgA = Image.open(BytesIO(responseA.content))
print(imgA)
for urlB in imageBurls:
responseB = requests.get(urlB)
imgB = Image.open(BytesIO(responseB.content))
hash0 = imagehash.average_hash(imgA)
hash1 = imagehash.average_hash(imgB)
cutoff = 5
if hash0 - hash1 < cutoff:
print(urlA)
print(urlB)
return('similar')
return('not similar')
df['NextImage'] = df['IMAGES'][df['IMAGES'].index - 1]
df['IsSimilar'] = df.apply(lambda x: image_similarity(x['IMAGES'], x['NextImage']), axis=1)
Author
Account Strength
80%
Account Age
6 years
Verified Email
Yes
Verified Flair
No
Total Karma
328
Link Karma
211
Comment Karma
71
Profile updated: 4 days ago
Posts updated: 4 months ago
Subreddit
Post Details
We try to extract some basic information from the post title. This is not
always successful or accurate, please use your best judgement and compare
these values to the post title and body for confirmation.
- Posted
- 5 years ago
- Reddit URL
- View post on reddit.com
- External URL
- reddit.com/r/datacleanin...