Coming soon - Get a detailed view of why an account is flagged as spam!
view details

This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

3
Remove rows that are too much alike not to be duplicates
Post Body

I have a dataset of real estate advertisements. Several of the lines are about the same real estate property so it's full of duplicates that aren't exactly the same. What would be the best methods to remove rows that are too much alike not to be duplicates?

It looks like this :

        ID  URL CRAWL_SOURCE    PROPERTY_TYPE   NEW_BUILD   DESCRIPTION IMAGES  SURFACE LAND_SURFACE    BALCONY_SURFACE ... DEALER_NAME DEALER_TYPE CITY_ID CITY    ZIP_CODE    DEPT_CODE   PUBLICATION_START_DATE  PUBLICATION_END_DATE    LAST_CRAWL_DATE LAST_PRICE_DECREASE_DATE
    0   22c05930-0eb5-11e7-b53d-bbead8ba43fe    http://www.avendrealouer.fr/location/levallois...   A_VENDRE_A_LOUER    APARTMENT   False   Au rez de chaussée d'un bel immeuble récent,...   ["https://cf-medias.avendrealouer.fr/image/_87...   72.0    NaN NaN ... Lamirand Et Associes    AGENCY  54178039    Levallois-Perret    92300.0 92  2017-03-22T04:07:56.095 NaN 2017-04-21T18:52:35.733 NaN
    1   8d092fa0-bb99-11e8-a7c9-852783b5a69d    https://www.bienici.com/annonce/ag440414-16547...   BIEN_ICI    APARTMENT   False   Je vous propose un appartement dans la rue Col...   ["http://photos.ubiflow.net/440414/165474561/p...   48.0    NaN NaN ... Proprietes Privees  MANDATARY   54178039    Levallois-Perret    92300.0 92  2018-09-18T11:04:44.461 NaN 2019-06-06T10:08:10.89  2018-09-25

So far I tried to compare the description :

    df['is_duplicated'] = df.duplicated(['DESCRIPTION'])

And to compare the array of photos :

    def image_similarity(imageAurls,imageBurls):
        imageAurls = ast.literal_eval(imageAurls)
        imageBurls = ast.literal_eval(imageBurls)
        for urlA in imageAurls:
            responseA = requests.get(urlA)
            imgA = Image.open(BytesIO(responseA.content))
            print(imgA)
            for urlB in imageBurls:
                responseB = requests.get(urlB)
                imgB = Image.open(BytesIO(responseB.content))    
                hash0 = imagehash.average_hash(imgA) 
                hash1 = imagehash.average_hash(imgB) 
                cutoff = 5

                if hash0 - hash1 < cutoff:
                    print(urlA)
                    print(urlB)
                    return('similar')
            return('not similar')

    df['NextImage'] = df['IMAGES'][df['IMAGES'].index - 1]
    df['IsSimilar'] = df.apply(lambda x: image_similarity(x['IMAGES'], x['NextImage']), axis=1)

Author
Account Strength
80%
Account Age
6 years
Verified Email
Yes
Verified Flair
No
Total Karma
328
Link Karma
211
Comment Karma
71
Profile updated: 4 days ago
Posts updated: 4 months ago

Subreddit

Post Details

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.
Posted
5 years ago