Help optimizing many to many join

This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

Post Body

So a bit of background this is a database representing surveys, surveytakers, and their demographics (Male, Female, location, etc)

The relationship between the takers and groups are in a through table, which i denormalized to include the survey for better join performance

SELECT count(*), taker.complete, group.name
from surveytaker taker
inner join surveytaker__group through
    on taker.id = through.surveytaker_id
    and taker.survey_id = through.survey_id
inner join group
    on through.group_id = group.id
    and group.company_id = taker.company_id
    and group.survey_id = taker.survey_id
    and through.group_id = group.id
where taker.survey_id = 22101
and taker.company_id = 2997
group by taker.complete, group.name

I've indexed the takers on (company_id, survey_id)

the groups on (company_id, survey_id)

and the through/join table on (survey_id, taker_id, group_id)

as you may be able to see from the group by, I'm trying to get the count of complete/incomplete surveytakers by the groups they belong to; the response rate by demographics. I know that by doing this, I'm joining (in the case of this survey) around 10k rows of takers, to around 15k rows of groups, to the join table of over 200k rows 💥. The explain (without the groupby as it add complexity to the explain with no cost) is below

Aggregate  (cost=22241.84..22241.85 rows=1 width=8)
  ->  Gather  (cost=1054.71..22241.83 rows=1 width=0)
    Workers Planned: 1
    ->  Nested Loop  (cost=54.71..21241.73 rows=1 width=0)
          ->  Nested Loop  (cost=54.28..18984.59 rows=286 width=16)
                ->  Parallel Bitmap Heap Scan on group   (cost=53.71..9241.82 rows=1451 width=12)
                      Recheck Cond: ((company_id = 2997) AND (survey_id = 22101))
                      ->  Bitmap Index Scan on group_company_id_survey_id_index  (cost=0.00..53.09 rows=2466 width=0)
                            Index Cond: ((company_id = 2997) AND (survey_id = 22101))
                ->  Index Only Scan using surveytaker__group_group_id_survey on surveytaker__group through  (cost=0.57..5.99 rows=72 width=12)
                      Index Cond: ((survey_id = 22101) AND (group_id = group.id))
          ->  Index Scan using surveytaker_pkey on survey_surveytaker taker  (cost=0.43..7.89 rows=1 width=12)
                Index Cond: (id = through.surveytaker_id)
                Filter: ((survey_id = 22101) AND (company_id = 2997))

The pain point here seems to be on these nested loops, postgres isn't doing an index only scan on the groups which is interesting... is there an index I'm missing? or is this simply the best that postgres can do on many joins and i need to shovel more money into hardware? Its a rather complex explain so I'm trying to wrap my head around it.

Author

Account Strength

90%

Account Age

7 years

Verified Email

Yes

Verified Flair

Total Karma

1,518

Link Karma

962

Comment Karma

556

Profile updated: 4 days ago

Posts updated: 1 day ago

Ophirril

Subreddit

r/SQL

Post Details

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.

Posted: 5 years ago
Reddit URL: View post on reddit.com
External URL: reddit.com/r/SQL/comment...