Working with batch effects

This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

Post Body

Two years ago, I started working on a project uses both RNAseq and ATACseq. It's supposed to be a simple Healthy Control (HC) vs disease study. The sample collection was done in 2 phases, and it was clear that there was a batch effect that we could adjust for.

However, recently, we received additional metadata. Plotting a PCA plot showed that there was another, larger batch effect that we didn't account for--location of sample donation. There are 4 different locations with the disease samples being from any of the 4, but ALL HC came from only one of the locations.

I resent the count data through DESeq2 with this design formula: phase location disease. It didn't fuss about collinearity like it often likes to do and then pooped out a big list of DEG in the RNAseq.

I could probably run the DEG through GSEA to see if the results match the disease's previously known signatures, but what statistical worries should I have about this design matrix? What justification would I need for this statistical asymmetry? Thanks.

Author

Account Strength

100%

Account Age

11 years

Verified Email

Yes

Verified Flair

Total Karma

146,573

Link Karma

39,202

Comment Karma

106,081

Profile updated: 3 days ago

Posts updated: 10 months ago

qwerty11111122

Msc | Academia

Subreddit

r/bioinformatics

Post Details

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.

Posted: 3 years ago
Reddit URL: View post on reddit.com
External URL: reddit.com/r/bioinformat...