Coming soon - Get a detailed view of why an account is flagged as spam!
view details

This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

1
fasttext to create embeddings for news articles using Python
Post Flair (click to view more posts with a particular flair)
Post Body

Hi! First off: I'm completely new to Python & ML in general and everything I've done so far is basically copy and pasting code until it worked.

I want to create a simple tool that analyzes news articles I gathered and put them on a 2D plot based on their content and how they are written. What I've done so far: Fetching the articles, cleaning them up, feeding them into a pre-trained fasttext model and then visualizing what I got using t-SNE. It works quite well and I'm getting clusters based on what the articles are talking about, great so far. Now I want to improve on that a little bit (with my little understanding of...well, everything) and ran into a few options.

Firstly, I read that fasttext (or basically anything that reduces a sentence down to tokens) can run into problems in terms of semantics of a sentence. Meaning that the order of how words appear in a sentence, which can influence the meaning, is ignored and as long sentences contain the same words, the resulting vectors will be the same. Then I ran into BERT, which seems to work with sentences. My question now is: In the context of news (I have about 400 articles about the same topic from a range of one week from different outlets), would there be any noticeable difference in the output? Or am I understanding this wrong from the start?

Secondly: While some sources say that t-SNE is the way to go, others talk about PCA or UMAP. Is there an inherent benefit to using any of them or does it boil down to "whatever works for me"?

So, basically: I already like what I get from fasttext t-SNE a simple scatter plot. Is there a way to drastically improve what I get or would anything else for my use-case be overkill? Thanks!

Author
Account Strength
100%
Account Age
12 years
Verified Email
Yes
Verified Flair
No
Total Karma
6,823
Link Karma
5,804
Comment Karma
991
Profile updated: 5 days ago
Posts updated: 4 months ago

Subreddit

Post Details

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.
Posted
4 years ago