Coming soon - Get a detailed view of why an account is flagged as spam!
view details

This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

0
How to go about implementing a hash index for my storage?
Post Body

Imagine i have to implement a time series data store where an entry looks like this:

{id - 64 bit auto incrementing long, time - 64 bit long, value - 64-512 bit binary, crc - 32 bit, version - 64 bit}

Primary key is {time, id}

The size of above entry would be between 36B - 92B.My table size would be at max 10GB.One host can be having 100s of table as this is a multi tenant system.

So I will have ~ 10GB/36B ~ 300M entries.

Now I have following req:

  1. Optimize for ingestion esp on tip(current time) which moves forwar
  2. Do deduplication based on {id time version} to reject lower versions synchronously. Again time here mostly would be tip
  3. Have support for fast snapshot of storage for backups
  4. Support deletion based on predicate which would be like:

Note that duplicates would be rare and hence I believe I would benefit from keeping an index(id time) in memory and not entire data records.

I am evaluating following:

  1. Hash/Range based index - I am thinking of a bitcask like storage where i can keep index in memory. Since an index entry would take {16byte for key 8byte for offset} = 24B, I would need 24B * 300 M ~ 7GB memory for index alone for 1 table which is a lot.Hence I am thinking of a slightly different design though where I will divide my store into N partitions internally on time(say 10) and keep only the bucket(s) which are actively ingesting in memory. Since my most common case is tip ingestion, it will be 1 bucket that would be memory and so my index size goes down by factor of 10. This however adds some complexity in design. Also I believe implementing 4 is tricky if no time predicate is in query and I have to open all buckets. I guess the one way to get around this is to track tombstones separately.
  2. LSM based engine - This should be obvious, however it does make sizing the memtable a bit tricky. Since the memtable now stores the whole entry, it means I can have less values in memory.
  3. BTree based engine - Thinking of something like Sqlite with primary key as {time id} (and not {id time}). However I don;t think it would shine on writes. This howevers offers ability to run complex queries(if needed in future).

Anyone wants to guide me here?

Edit: Title wrongly says "hash", ignore it

Author
Account Strength
50%
Account Age
2 years
Verified Email
Yes
Verified Flair
No
Total Karma
213
Link Karma
197
Comment Karma
16
Profile updated: 3 days ago
Posts updated: 3 months ago

Subreddit

Post Details

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.
Posted
10 months ago