Wondering if anyone here has some advise or a good place to learn about dealing with databases with Python. I know SQL fairly well for pulling data and simple updates, but running into potential performance issues the way I've been doing it. Here are 2 examples.
-
Dealing with Pandas dataframes. I'm doing some reconciliation between a couple of different datasources. I do not have a primary key to work with. I have some very specific matching criteria to determine a match. The matching process is all built within Python. Is there a good way to do the database commits with updates/inserts en masse vs. line by line? I've looked into upsert (or inserts with clause to update with existing data), but pretty much all examples I've seen rely on primary keys (which I don't have since the data has 4 columns I'm matching on).
-
Dealing with JSON files which have multiple layers of related data. My database is built in such a way that I have a table for header information, line level detail, then third level with specific references assigned to the line level detail. As with a lot of transactional type databases there can be multiple references per line, multiple lines per header. I'm currently looping through the JSON file starting with the header information to create the primary key, then going to the line level detail to create a primary key for the line, but also include the foreign key for the header and also with the reference data. Before inserting I'm doing a lookup to see if the data already exists and then updating if it does or inserting a new record if it doesn't. This works fine, but is slow taking several seconds for maybe 100 inserts in total. While not a big deal since it's for a low volume of sales. I'd rather learn best practice and do this properly with commits/transactions vs inserting an updating each record individually within the ability to rollback should an error occur.
I typically like using sqlalchemy's ORM for my database operations.
For something simple, using it with sqlite3 can be more efficient than parsing through a JSON file. Combine this with a primary key to help with the double-insertion problem (so to not have to iterate through before inserting) and it can work out quite well.
I've never really used Pandas dataframes though.
Another fun option (if willing to not use a database, but rather a disk-cache) is https://github.com/grantjenks/python-diskcache. Behind the scenes it actually also uses a sqlite3 db.
What function or class from that library would I use to do this? Or what key words can I use to search and learn more? I'm struggling wrapping my head around it.
Here is a basic tutorial for how sqlalchemy works. If you already have a database in place you'll have to port your schema to it, but it might be worth trying out and seeing if it's more performat.