This post has been de-listed
It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.
My mind is a bit fried from doing a lot of coding over the past 24 hours. :) The real-time comment table has a key called "is_submitter" that is true if the comment OP is also the submission OP.
What I want to do is find the percentage of total link_ids where the submitter also made at least one comment. This can be done using just the comment table but my SQL is a bit rusty at the moment.
I would imagine it would start with grouping comments by link_id (submission) and finding the percentage between link_ids where all is_submitter values are false and link_ids where at least one is_submitter value is true.
My starting attempt is something like:
SELECT is_submitter, link_id FROM `pushshift.rt_reddit.comments` GROUP by link_id
The issue there is that is_submitter needs some type of aggregation function applied to it. I might be going down the wrong path here but maybe someone with more SQL kung-fu can hack out a SQL statement to find the answer.
Thanks!
Edit: I'm getting close.
SELECT link_id FROM pushshift.rt_reddit.comments
WHERE is_submitter = true GROUP BY link_id;
That gives the total number of submissions where the submitter commented at least once.
SELECT link_id FROM pushshift.rt_reddit.comments
GROUP BY link_id;
The second one would give the total number of submissions. So I just need to take the ratio of the two. For some reason, I keep thinking there is a more elegant way to do this with one SQL statement.
Oh Oh! This query is super interesting:
SELECT sum(cast(is_submitter as INT64)),link_id FROM `pushshift.rt_reddit.comments` WHERE is_submitter = true GROUP BY link_id ORDER BY 1 DESC;
This shows every link_id and the total number of times the submitter commented in that submission. This is super interesting and probably a good basis to answer my original question. But when I ran it, I saw numbers in the hundreds for the submitter. Now I need to figure out what the hell is going on for the submitter to post over a hundred times in their own submission.
And this is why BigQuery is AWESOME with big data. :)
I also just realized that when I converted the reddit id from base36 to base10, I lost the ability to easily check the submission (it's still possible).
I'm not sure if BigQuery has a base10 to base36 function, but the submission table does have the name field with the base36 value. I might go ahead and add the name field for comments ... hmm. But there should be a function to do base10 -> base36 conversions for BigQuery.
Edit 2 Actually, all comments have a permalink value which is tied to the link_id, so we can do this:
SELECT sum(cast(is_submitter as INT64)),link_id,ANY_VALUE(permalink) FROM `pushshift.rt_reddit.comments` WHERE is_submitter = true GROUP BY link_id ORDER BY 1 DESC;
And there it is! It was a bot that went crazy:
https://www.reddit.com/r/Testbot114/comments/8ljacv/test/
This is really fun!
Subreddit
Post Details
- Posted
- 6 years ago
- Reddit URL
- View post on reddit.com
- External URL
- reddit.com/r/bigquery/co...