Remove rows that are too much alike not to be duplicates

This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

Post Body

I have a dataset of real estate advertisements. Several of the lines are about the same real estate property so it's full of duplicates that aren't exactly the same. What would be the best methods to remove rows that are too much alike not to be duplicates?

It looks like this :

        ID  URL CRAWL_SOURCE    PROPERTY_TYPE   NEW_BUILD   DESCRIPTION IMAGES  SURFACE LAND_SURFACE    BALCONY_SURFACE ... DEALER_NAME DEALER_TYPE CITY_ID CITY    ZIP_CODE    DEPT_CODE   PUBLICATION_START_DATE  PUBLICATION_END_DATE    LAST_CRAWL_DATE LAST_PRICE_DECREASE_DATE
    0   22c05930-0eb5-11e7-b53d-bbead8ba43fe    http://www.avendrealouer.fr/location/levallois...   A_VENDRE_A_LOUER    APARTMENT   False   Au rez de chaussÃ©e d'un bel immeuble rÃ©cent,...   ["https://cf-medias.avendrealouer.fr/image/_87...   72.0    NaN NaN ... Lamirand Et Associes    AGENCY  54178039    Levallois-Perret    92300.0 92  2017-03-22T04:07:56.095 NaN 2017-04-21T18:52:35.733 NaN
    1   8d092fa0-bb99-11e8-a7c9-852783b5a69d    https://www.bienici.com/annonce/ag440414-16547...   BIEN_ICI    APARTMENT   False   Je vous propose un appartement dans la rue Col...   ["http://photos.ubiflow.net/440414/165474561/p...   48.0    NaN NaN ... Proprietes Privees  MANDATARY   54178039    Levallois-Perret    92300.0 92  2018-09-18T11:04:44.461 NaN 2019-06-06T10:08:10.89  2018-09-25

So far I tried to compare the description :

    df['is_duplicated'] = df.duplicated(['DESCRIPTION'])

And to compare the array of photos :

    def image_similarity(imageAurls,imageBurls):
        imageAurls = ast.literal_eval(imageAurls)
        imageBurls = ast.literal_eval(imageBurls)
        for urlA in imageAurls:
            responseA = requests.get(urlA)
            imgA = Image.open(BytesIO(responseA.content))
            print(imgA)
            for urlB in imageBurls:
                responseB = requests.get(urlB)
                imgB = Image.open(BytesIO(responseB.content))    
                hash0 = imagehash.average_hash(imgA) 
                hash1 = imagehash.average_hash(imgB) 
                cutoff = 5

                if hash0 - hash1 < cutoff:
                    print(urlA)
                    print(urlB)
                    return('similar')
            return('not similar')

    df['NextImage'] = df['IMAGES'][df['IMAGES'].index - 1]
    df['IsSimilar'] = df.apply(lambda x: image_similarity(x['IMAGES'], x['NextImage']), axis=1)

Author

Account Strength

80%

Account Age

6 years

Verified Email

Yes

Verified Flair

Total Karma

328

Link Karma

211

Comment Karma

Profile updated: 4 days ago

Posts updated: 4 months ago

MikeREDDITR

Subreddit

r/datacleaning

Post Details

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.

Posted: 5 years ago
Reddit URL: View post on reddit.com
External URL: reddit.com/r/datacleanin...