One part of my app acts as a pseudo data-pipeline, where I ETL a single player's data at a time (I have to due to the nature of the API I'm pulling data from).
I am currently using pandas for all of this, where I pull the data from the API, throw it in a dataframe, transform it, and run `pandas.to_sql()` to store it.
The problem is that this is memory intensive, and I'm running into RAM issues when running this in a docker container.
What would be a good memory efficient tool to work with moving datasets through an application?
To clarify: I am using pandas as a tool to efficiently pass around this dataset that I add data to, transform it, and store it in the database. I would need this replacement tool to do the same thing in python. I also have only 1 server, so a distributed system like Spark isn't an option for me
Subreddit
Post Details
- Posted
- 8 months ago
- Reddit URL
- View post on reddit.com
- External URL
- reddit.com/r/dataenginee...