updating primary key of master and child tables for large tables
I have a fairly huge database with a master table with a single column GUID (custom GUID like algorithm) as primary key and 8 child tables that have foreign key relationships with this GUID column. All the tables have approximately 3-8 million records. None of these tables have any BLOB/CLOB/TEXT or any other fancy data types just normal numbers, varchars, dates, and timestamps (about 15-45 columns in each table). No partitions or other indexes other than the primary and foreign keys.
Now, the custom GUID algorithm has changed and though there are no collisions I would like to migrate all the old data to use GUIDs generated using the new algorithm. No other columns need to be changed. Number one priority is data integrity and performance is secondary.
Some of the possible solutions that I could think of were (as you will probably notice they all revolve around one idea only)
- add new column ngu_id and populate with new gu_id; disable constraints; update child tables with ngu_id as gu_id; renaname ngu_id->gu_id; re-enable constraints
- read one master record and its dependent child records from child tables; insert into the same table with new gu_id; remove all records with old gu_ids
- drop constraints; add a trigger to the master table such that all the child tables are updated; start updating old gu_id's with new new gu_ids; re-enable constraints
- add a trigger to the master table such that all the child tables are updated; start updating old gu_id's with new new gu_ids
- create new column ngu_ids on all master and child tables; create foreign key constraints on ngu_id columns; add update trigger to the master table to cascade values to child tables; insert new gu_id values into ngu_id column; remove old foreign key constraints based on gu_id; remove gu_id column and rename ngu_id to gu_id; recreate constraints if necessary;
- use on update cascade if available?
My questions are:
- Is there a better way? (Can't burrow my head in the sand, gotta do this)
- What is the most suitable way to do this? (I've to do this in Oracle, SQL server and mysql4 so, vendor-specific hacks are welcome)
- What are the typical points of failure for such an exercise and how to minimize them?
If you are with me so far, thank you and hope you can help :)
Your ideas should work. the first is probably the way I would use. Some cautions and things to think about when doing this: Do not do this unless you have a current backup. I would leave both values in the main table. That way if you ever have to figure out from some old paperwork which record you need to access, you can do it. Take the database down for maintenance while you do this and put it in single user mode. The very last thing you need while doing something like this is a user attempting to make changes while you are in midstream. Of course, the first action once in single user mode is the above-mentioned backup. You probably should schedule the downtime for some time when the usage is lightest. Test on dev first! This should also give you an idea as to how long you will need to close production for. Also, you can try several methods to see which is the fastest. Be sure to communicate in advance to users that the database will be going down at the scheduled time for maintenance and when they can expect to have it be available again. Make sure the timing is ok. It really makes people mad when they plan to stay late to run the quarterly reports and the database is not available and they didn't know it. There are a fairly large number of records, you might want to run the updates of the child tables in batches (one reason not to use cascading updates). This can be faster than trying to update 5 million records with one update. However, don't try to update one record at a time or you will still be here next year doing this task. Drop indexes on the GUID field in all the tables and recreate after you are done. This should improve the performance of the change.
Create a new table with the old and the new pk values in it. Place unique constraints on both columns to ensure you haven't broken anything so far.
Run an updates against all the tables to modify the old value to the new value.
Enable the PK, then enable the FK's.
It's difficult to say what the "best" or "most suitable" approach is as you have not described what you are looking for in a solution. For example, do the tables need to be available for query while you are migrating to new IDs? Do they need to be available for concurrent modification? Is it important to complete the migration as fast as possible? Is it important to minimize the space used for migration?
Having said that, I would prefer #1 over your other ideas, assuming they all met your requirements.
Anything that involves a trigger to update the child tables seems error-prone and over complicated and likely will not perform as well as #1.
Is it safe to assume that new IDs will never collide with old IDs? If not, solutions based on updating the IDs one at a time will have to worry about collisions -- this will get messy in a hurry.
Have you considered using CREATE TABLE AS SELECT (CTAS) to populate new tables with the new IDs? You'll be making a copy of your existing tables and this will require additional space, however it is likely to be faster than updating the existing tables in place. The idea is: (i) use CTAS to create new tables with new IDs in place of the old, (ii) create indexes and constraints as appropriate on the new tables, (iii) drop the old tables, (iv) rename the new tables to the old names.
In fact, it depend on your RDBMS.
Using Oracle, the simpliest choice is to make all of the foreign key constraints "deferred" (check on commit), perform updates in a single transaction, then commit.