This post was deleted.
Hello all! So, rather than pollute the site with a bunch of details about what's going on with things I'll try to keep a place here where I make updates/you all can ask questions/give feedback/etc...
Current items of note:
Media
It became clear pretty early in the development of this site that dealing with the media storage was going to be a bit of a unique challenge due to the nature of re-posted/stolen content. For example, there are currently more than 100 images that have been posted over 500 times in their exact same format... and many of them are over 1MB in size - that's roughly 50GB just for the same 100 images that _could_ be stored as 100MB if we don't store them as duplicates. Sadly, the "off the shelf" solution that we were using to store/manage the media files (and handle all of the thumbnails and previews that need to be generated) doesn't have any duplicate file support or handling, so we had to build our own system to handle it... which isn't all that difficult, just time consuming to build migrate all the data over.
Sadly, the "off the shelf" solution that we were using to store/manage the media files (and handle all of the thumbnails and previews that need to be generated) doesn't have any duplicate file support or handling, so we had to build our own system to handle it... which isn't all that difficult, just time consuming to build migrate all the data over. The good news here is that a) it's almost done, b) we did a deep dive into "similarity" within a group of images and have a pretty solid solution that not only handles exact duplicate images but it's also very good at identifying similar images - this hasn't been put into action on the website yet but it should have some very interesting applications. (Identifying stolen images, finding related/alt accounts, etc)
The good news here is that a) it's almost done, b) we did a deep dive into "similarity" within a group of images and have a pretty solid solution that not only handles exact duplicate images but it's also very good at identifying similar images - this hasn't been put into action on the website yet but it should have some very interesting applications. (Identifying stolen images, finding related/alt accounts, etc)
This system should help keep costs down and also prove to be a very valuable feature for the community to help deal with spam/stolen content.
Search
So, "search" in a technical sense is hard to do right/well. It's easy to implement search on small data sets or implement it in a way that's just not very good but "functional". Doing it well with a large dataset generally requires making concessions or building out some real infrastructure to handle it. We've had experience on other projects with handling search on large-ish datasets so we came into this with an idea of what we were going to need to do - but it's still proven to be problematic in some unique ways. We're currently utilizing a very "fresh"/ie:barely out of beta search technology (https://www.meilisearch.com/) that is very good for small-ish datasets but turns out we're pushing it into territory it wasn't really ready for yet with the size of the data and nature of our use. Thankfully the developers have been amazing - going as far as getting into our search server and monitoring what/how we're using it and making very effective changes to their platform to make it work for us (and others who are running it in a similar way).
For the time being (until the next update here in a couple weeks) it looks like the search is going to remain a bit "behind" - we're just sending too many documents (posts) to it too often for it to keep up.. Example: A post comes into enmlounge.com from Reddit and we ingest it into our database - at that point it's immediately available on the website. At the time it's ingest it's also sent over to the search engine to be 'indexed' into the search database - but it's taking anywhere from an hour to 12 hours for that data to be 'live'. This is because while we're also constantly ingesting new data (posts, subreddits, authors, anything that can be searched) we're also making updates to the existing search database - like removing deleted posts, spam flagged posts, banned users, etc and all those tasks are just bogging things down.
As things with the website progress (hopefully) in terms of financial support we'll be able to throw some more money at the problem to speed things up... but also, Meilisearch is working on some updates now that should greatly improve things.
Spam
Here's some general numbers to give you all an idea of what kind of spam problem reddit has...
- Out of the total number of reddit users that have been 'discovered' by our crawlers - a full 1/3 of them are either; disabled (by reddit), suspended (by reddit) or spam banned (by us).
- Out of the total number of reddit posts that have been 'discovered' by our crawlers - again, just over 1/3 of the total posts are either deleted (by reddit admin, moderators or the author) or classified as spam (by us). And the ratio of spam to deleted is 5:1.
At the moment we've got the foundation for a pretty nifty spam detection system in place, but it's only being applied to 1 out of 3 "data points" that it can be tied into.
Right now it's only looking at post titles and it handles it's approach in two separate ways:
1) When each post is entered into the database the title is analyzed and given a unique "value" - this "value" is unique to the content of the title, not the title itself. Meaning if another post comes in with the exact same title it will get that same identifier. This helps us easily identify exact duplicates - but that's always been easy. There's also some magic implemented that will identify "similar" titles based on a sliding scale of similarity.
2) Raw "banned" terms - There's a system in place that will check all incoming (and existing) post titles for a list of "banned" terms and if they're found the post is flagged. This system also supports "pattern matching" for these banned terms, so it can get very complex and powerful - or remain very simple.
At the moment both of these are only applied to post titles - which the spammers have clearly identified as something they need to try to use variance with to bypass existing filters/auto-mod rules (and, which, the similarity system mentioned above does a splendid job identifying).
We've been working to fine-tune these systems and make sure that all the kinks are ironed out before also implementing them for;
1) Post body - self post content, there appear to be many many many self posts created with the exact same content that could easily be flagged as spam.
2) Account descriptions - Lots of the spammers are using the exact same, or very similar account descriptions - again, should be easily identified and flagged as spam.
As things are right now the spam on enmlounge.com compared to reddit for the subreddits we're watching is night and day but it can still get much better.
(I also hope to implement a system to allow subreddit mods to utilize this spam-detection system combined with auto-mod to help clean things up - any lifestyle-related sub mods that are interest please feel free to reach out)
I hope this post shines a bit of light on what we're working on - there's tons of cool stuff we've got in the works but were still really just building out the foundation for a lot of these more advanced features.
Some of the stuff we've got in the works...
1) Mobile layout - currently the website is not built with mobile users in mind, mainly due to the nature of how fast things are changing and there being no real "design" in place - we're just patching things together as we get them working with a complete disregard to how it looks in most cases. This has resulted in a website that is basically not usable on mobile devices - we're working on that!
2) Ability to "follow" Locations (you can already follow users and subreddits) and specific "phrases" so that posts that are tagged with that location or include that phrase show up in "your feed".
3) Email/SMS alerts - Do you want to know any time a new post is created in a subreddit you follow? We can do that. How about any time a new post is found that has a specific word or phrase that you configure? (ie: any post that has the word "hotwife" as well as "F4M" and "miami" and was posted into /r/HotWifeRequests) Or maybe you just want a daily update email that gives you highlights from the various users/subreddits/locatons/terms that you've followed?
As always - we're open to feedback, ideas, requests, etc! Feel free to submit those things here in the comments, or DM them to this account or you can always submit a contact form on the website (https://enmlounge.com/contact)
Subreddit
Post Details
- Posted
- 1 year ago
- Reddit URL
- View post on reddit.com
- External URL
- reddit.com/r/enmlounge/c...