This post has been de-listed
It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.
Hi! First off: I'm completely new to Python & ML in general and everything I've done so far is basically copy and pasting code until it worked.
I want to create a simple tool that analyzes news articles I gathered and put them on a 2D plot based on their content and how they are written. What I've done so far: Fetching the articles, cleaning them up, feeding them into a pre-trained fasttext model and then visualizing what I got using t-SNE. It works quite well and I'm getting clusters based on what the articles are talking about, great so far. Now I want to improve on that a little bit (with my little understanding of...well, everything) and ran into a few options.
Firstly, I read that fasttext (or basically anything that reduces a sentence down to tokens) can run into problems in terms of semantics of a sentence. Meaning that the order of how words appear in a sentence, which can influence the meaning, is ignored and as long sentences contain the same words, the resulting vectors will be the same. Then I ran into BERT, which seems to work with sentences. My question now is: In the context of news (I have about 400 articles about the same topic from a range of one week from different outlets), would there be any noticeable difference in the output? Or am I understanding this wrong from the start?
Secondly: While some sources say that t-SNE is the way to go, others talk about PCA or UMAP. Is there an inherent benefit to using any of them or does it boil down to "whatever works for me"?
So, basically: I already like what I get from fasttext t-SNE a simple scatter plot. Is there a way to drastically improve what I get or would anything else for my use-case be overkill? Thanks!
Subreddit
Post Details
- Posted
- 4 years ago
- Reddit URL
- View post on reddit.com
- External URL
- reddit.com/r/learnmachin...