Tuesday, 3 January 2012

Primary Keys in Entity Framework and Sync Framework

Hold on - we're going for a walk on the wild side! It will take a few minutes to read - but it's a thriller...
This is a summary of what I've discovered after a bit of research - hopefully it will help other developers/architects out.

As part of a project I'm looking at creating a SQL server replicated architecture... and need to avoid clashes in the PK. The icing on the cake is that we're using Entity Framework heavily in the project for data access.
Exhibit 1 ... http://msdn.microsoft.com/en-us/library/bb726011.aspx

This article gives a few options about choosing a suitable Primary key in replication scenarios:

1. Using a GUID
2. Pk that includes a node ID
3. Natural keys

OK.... starting from the bottom up...

  • Natural Keys won't work for our project. Full stop
  • A PK that includes a Node ID - means adding a column and giving each client  a Node ID. Not the end of the world, but... doesn't feel right...
  • GUID... sounds good, or is it?

Enter
Exhibit 2 - http://leedumond.com/blog/using-a-guid-as-an-entitykey-in-entity-framework-4/
Yes, it's all true... I tried it and had to hack the EF model XML to include the correct attribute to get it to work. And, Yes, doing an update loses the change.
Yes, it's a feature and yes, it's not pleasant - it would mean after every update hacking the XML to find all the relevant GUID PKs and adding in the attribute. I'm sure we could automate it by some clever searching and replacing, but it's not right.

Also, Enter
Exhibits 3, 4, 5 and 6
http://www.eggheadcafe.com/tutorials/sql-server/d662b371-ed27-481c-aee0-ebe7cf2d9fad/why-guids-are-not-a-good-idea-for-sql-server-primary-keys.aspx

http://databases.aspfaq.com/database/what-should-i-choose-for-my-primary-key.html

http://www.simple-talk.com/sql/database-administration/the-identity-crisis-in-replication/

http://sqlblogcasts.com/blogs/martinbell/archive/2009/05/25/GUID-Fragmentation-in-SQL-Server.aspx


From this set of links you start to see that GUIDs are good, but come with a number of caveats.

Also using newsequentialid() - doesn't necessarily generate sequential GUIDs - I tried this. If you do a number of inserts close together then you do get them, but not always. 
So, you then can potentially have the performance issues.
Also, the space issue is relevant for us - since we're looking to replicate to SQL server compact.

Alternatives?

First off,
well we could use a normal ID (int) column and change the seed value for each client.
E.g. The Server has a seed of 10,000000
Client 1 has a seed of 20, 000000,
Client 2 has a seed of 30, 000000 etc

So, each client would be able to make 9,999,999 inserts before a conflict and we'd have a limit of 214(X2) users because we could have negative numbers as well.
Lets just do some maths....
The max value of an int is +/-2,147,483647. 
If you're really worried about conflicts you could go for using a bigint as the ID column, then you could have something like 9223327036850 clients! The seeds could then be something like 100,000000, 200000000, 300000000, etc so 99,999,999 inserts before a conflict - which should be enough for a lifetime - e.g. one insert a second would take you 1157 days. 100 inserts a day would take over 2500 years.
If you were using the the smaller int then 100 inserts a day would last over 250 years!

That means changing the seed on each client appropriately.

Now, there are ways of doing this:
http://www.techrepublic.com/blog/datacenter/how-do-i-reseed-a-sql-server-identity-column/406
but it seems there a some gotchas that you have to be aware of:
http://geekswithblogs.net/argot/archive/2009/10/18/sync-framework-common-practise-of-ado.net.aspx


Basically, you have to store and remember seed values before/after synchronisation.

You'd keep a list of seed values for each client. So any time you get a  new client the first sync would also download its seed value.

This solution has a certain appeal, because:

a. it leaves the PK alone as a single ID column
b. The replication mechanism takes care of the seeding/reseeding.
c. Entity Framework designer won't throw any wobblies.

Next alternative... 
Go back to a compound PK, ID plus a Node ID.
a. Entity Framework should cope with this.
b. Replication mechanism should be a bit simpler.
But... it means infecting the database with the replication details i.e. the Node ID and artificially creating a more complex PK. So, you can't just go
select * from Users where ID = 10.

Another thought....


This is horrible I think, but has a certain "something" about it!
Leave the PKs alone completely - simple ID column with an int.
Then code for the PK conflicts in the sync framework - as it will detect PK conflicts. 
So, we write logic that will do something like:
- determine of the inserted record exists on the server 
- if not, then add it (generating a new record/PK) then remove it from the client and
- resync the client to pull down the "new" record but with the updated PK.

You'd have to write this sort of thing to cover various scenarios and it may/would be table-specific - but what if we replicating a lot of tables?


Summing up...


We're looking at  either the compound PK or different seed values.
We're going to try some stuff out - probably the different seed values first as that leaves the EF model alone completely - and hopefully I'll try and report back how it went.