A common method of producing random unique identifiers in SQL server is by using a GUID field, calling `newid()`

to generate the data. For the most part, this works because it’s 128 bits worth of random data, which means there is a very low probability of duplicate records for most databases.

However, it is also common to combine this with the `checksum()`

function to reduce it to a 32 bit integer. This makes collisions much more likely, even in relatively small databases. For example, the GUIDs `28258F69-6536-4198-BE37-94960ABF054F`

and `49B60D4B-DC4A-4E18-825E-B4C99713D011`

both checksum to `0xC3AD13D3`

. With a table of about 100,000 rows collisions will start to occur more frequently by the birthday paradox.

def P(x, n):
return 1 - ((x-1)/x)**(n * (n-1)/2)
>>> P(2**32,77500)
0.503022489601693
>>> P(2**32,200000)
0.9905011979711653
>>> P(2**53,10000000)
0.005535735601501335
>>> P(2**53,100000000)
0.42599174817345076 |

Using this maths, we can see that using a 32 bit random number, the probability of getting at least one collision is 50% at around 77,500 rows and 99% at 200,000 rows. We can also see that if we increase this to a 53 bit number, 10 million rows gives a 0.55% chance of getting at least collision and 100 million rows gives a 42.5% chance of getting at least one collision, so 64 bit should be plenty.

For higher precision numbers, we can use mpmath

from mpmath import *
mp.dps=50
def P(x, n):
return 1 - ((x-1)/x)**(n * (n-1)/2)
>>> print( P(mpf(2)**mpf(64), mpf(1000000000)) )
0.026741008873954667138505944308767166153477279783999
>>> print( P(mpf(2)**mpf(64), mpf(10000000000)) )
0.93349681454859684590636893705314709163709804436875 |

From here you can see that a 64 bit number has a 2.6% chance of getting a single collision in a 1 billion row table, and a 93% chance in a 10 billion row table.

A compromise of both is to simply truncate the GUID at 64 bits and optionally convert to a `bigint`

.

SELECT CONVERT(BIGINT, CONVERT(BINARY(8), newid())) |

If you leave it as binary and don’t need to convert to an integer type, this does not have to be 8 bytes. For example, you could have a 5 or a 10 byte code.

None of these are perfect but the probability of a collision decreases with more bits. If 128 bit is too long for you (e.g. to display to users) but 32 bit generates too many collisions, try a compromise such as 64 bit.

If you are consistent enough, you may even be able to store the original GUID and just display the truncated form, which could allow you to change the length displayed later without changing the probability of collisions. This is more flexible but may lead to confusion among users and consistency is required (differing lengths could lead to bugs).