Coming soon - Get a detailed view of why an account is flagged as spam!
view details

This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

87
How are you guys validating your data?
Post Flair (click to view more posts with a particular flair)
Post Body

I saw similar posts but I also wanted to share the specific flow in our data pipeline. The data warehousing project started at the my company some months ago and we setup many ETL jobs in our pipeline using Spark-Scala to write the logic and Airflow to schedule and run the processes to deliver data to business teams on Redshift tables to be queried.

Currently everything is running smoothly and we never encountered any major problems within our pipeline, and now we're doing some research to setup a Data Quality project since we're not testing our data yet.

I've personaly looked into the Great Expectations Library and it seems promissing to implement it at the end of some Airflow pipelines to do some basic validations, such as checking if a column value is Null and send us an alert. We don't need anything too fancy yet.

I wanted to know how are you guys doing Data Validation at your companies! Which tools are you using? Are you checking the resulting table at the end of your pipelines?

Author
Account Strength
60%
Account Age
2 years
Verified Email
Yes
Verified Flair
No
Total Karma
1,176
Link Karma
466
Comment Karma
695
Profile updated: 6 hours ago
Posts updated: 8 months ago

Subreddit

Post Details

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.
Posted
2 years ago