Reddit September 2018 Comments are now available for download

This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

Post Body

Stats:

Key	Value
Filename	RC_2018-09.xz
Location	https://files.pushshift.io/reddit/comments/RC_2018-09.xz
Start Time	2018-09-01 00:00:00 UTC
End Time	2018-09-30 23:59:59 UTC
Compressed Size	10,715,442,268 bytes (~11 GB)
Uncompressed Size	117,964,567,469 bytes (~118 GB)
Compression Type	.xz (LZMA/LZMA2)
Subreddit Cardinality	109,651
Author Cardinality	5,052,316
Largest Score	65,693
Lowest Score	-59,834
Number of Objects	104,473,929 comments
SHA256 Checksum	5324affffdc7f39d2bd4e109adffbd3e2b245d9f57cc67759d7e109ea2d9ebb4
File Format	ndjson (new line \n delimited JSON objects)
File Encoding	UTF-8 (Unicode Encoded / 7-bit ASCII Safe)
Data Visual	Hourly View of Data
Top Subreddits	50 Most Active Subreddits
Top Authors	50 Most Active Authors
Time View	Top 5 Subreddits Time Aggregation
Term View	Top 25 Subreddits with Comments mentioning Trump
Admin Activity	Top 15 Subreddits with the most admin comments
Verbose Comments	Top 25 Subreddits with comments greater than 5,000 characters in length
Huge Trees	Top 5 subreddits with comment nest levels greater than 500
Fast Replies	Top 10 subreddits with comment replies less than 30 seconds
Fast Replies (Authors)	Top 100 Authors with the most comment replies less than 30 seconds

This file contains Reddit comments for September, 2018. There are four quarantined subreddits included in this dump: ice_poseidon, cringeanarchy, theredpill and braincels. I decided to include these in the standard dump since they were a part of previous dumps for a long time and they represent four of the largest subreddits quarantined to date by Reddit.

Python example of reading data (read_data.py):

#!/usr/bin/env python3

import ujson as json
import sys

for line in sys.stdin:
    # obj is a dict object representing the comment data
    obj = json.loads(line)  
    print(obj['subreddit'],obj['author'],obj['score'],sep=',')

Linux Command line to process the first 1,000 comments:
xz -cd RC_2018-09.xz | head -n 1000 | ./read_data.py

Author

Account Strength

100%

Account Age

11 years

Verified Email

Verified Flair

Total Karma

143,730

Link Karma

34,810

Comment Karma

108,242

Profile updated: 3 days ago

Posts updated: 6 months ago

Stuck_In_the_Matrix

Subreddit

r/pushshift

Post Details

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.

Posted: 6 years ago
Reddit URL: View post on reddit.com
External URL: reddit.com/r/pushshift/c...