I'm looking for a tool that has a functionality that I have not been able to find (in my limited googling).
I'm developing a pipeline that is pretty linear and written in pure python - nothing really run concurrently, each steps is executing after the previous, etc. It has many steps though, so editing and creating new steps in the pipeline takes a lot of time because I have to wait for the previous steps to execute before it gets to the latest step in the pipeline.
I'm looking for a way to save the state of the data after it completes a certain step, and then run the last step using the data in the modified form.
For example: I have a pipeline with 4 steps - Extract JSON1, Extract JSON2, Create a DataFrame from both JSONs, Store it in a Database. If I have already developed steps 1-3, I don't want to have to keep rerunning the whole script to develop step 4. I would want to automatically save the output from the previous steps, and just work on step 4 with the data already collected/modified.
I know that I could simply save the data in it's own file and do it all manually, but I was wondering if a tool already existed where you could work with the data sequentially and essentially save the state of the data and just work with it that way. This would be a great time saver for me!
Any help is appreciated, thanks!
Subreddit
Post Details
- Posted
- 1 year ago
- Reddit URL
- View post on reddit.com
- External URL
- reddit.com/r/dataenginee...