fasttext to create embeddings for news articles using Python

This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

Post Flair (click to view more posts with a particular flair)

Question

Post Body

Hi! First off: I'm completely new to Python & ML in general and everything I've done so far is basically copy and pasting code until it worked.

I want to create a simple tool that analyzes news articles I gathered and put them on a 2D plot based on their content and how they are written. What I've done so far: Fetching the articles, cleaning them up, feeding them into a pre-trained fasttext model and then visualizing what I got using t-SNE. It works quite well and I'm getting clusters based on what the articles are talking about, great so far. Now I want to improve on that a little bit (with my little understanding of...well, everything) and ran into a few options.

Firstly, I read that fasttext (or basically anything that reduces a sentence down to tokens) can run into problems in terms of semantics of a sentence. Meaning that the order of how words appear in a sentence, which can influence the meaning, is ignored and as long sentences contain the same words, the resulting vectors will be the same. Then I ran into BERT, which seems to work with sentences. My question now is: In the context of news (I have about 400 articles about the same topic from a range of one week from different outlets), would there be any noticeable difference in the output? Or am I understanding this wrong from the start?

Secondly: While some sources say that t-SNE is the way to go, others talk about PCA or UMAP. Is there an inherent benefit to using any of them or does it boil down to "whatever works for me"?

So, basically: I already like what I get from fasttext t-SNE a simple scatter plot. Is there a way to drastically improve what I get or would anything else for my use-case be overkill? Thanks!

Author

Account Strength

100%

Account Age

12 years

Verified Email

Yes

Verified Flair

Total Karma

6,823

Link Karma

5,804

Comment Karma

991

Profile updated: 5 days ago

Posts updated: 4 months ago

trecoolman

Subreddit

r/learnmachinelearning

Post Details

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.

Posted: 4 years ago
Reddit URL: View post on reddit.com
External URL: reddit.com/r/learnmachin...