This post has been de-listed
It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.
I saw similar posts but I also wanted to share the specific flow in our data pipeline. The data warehousing project started at the my company some months ago and we setup many ETL jobs in our pipeline using Spark-Scala to write the logic and Airflow to schedule and run the processes to deliver data to business teams on Redshift tables to be queried.
Currently everything is running smoothly and we never encountered any major problems within our pipeline, and now we're doing some research to setup a Data Quality project since we're not testing our data yet.
I've personaly looked into the Great Expectations Library and it seems promissing to implement it at the end of some Airflow pipelines to do some basic validations, such as checking if a column value is Null and send us an alert. We don't need anything too fancy yet.
I wanted to know how are you guys doing Data Validation at your companies! Which tools are you using? Are you checking the resulting table at the end of your pipelines?
Subreddit
Post Details
- Posted
- 2 years ago
- Reddit URL
- View post on reddit.com
- External URL
- reddit.com/r/dataenginee...