|Given tables tracking users and their hobbies, how do we write a query that ranks pairs of users on hobby similarity? |
It's a SQL version of a famous computing problem known as nearest neighbour search or similarity search: given a metric space of K vectors, return the N most similar.
Start by defining a similarity measure for one pair of users: if user A has x hobbies and user B has y hobbies, one measure of their similarity is the number of hobbies they share, divided by the number of hobbies that either has. Is that plausible? If A and B both have hobbies 1,14,27, they are 100*3/3=100% similar. If A has 9,13 while B has 6,9,15, together they have 4 hobbies, one of which they share, so their similarity is 25%.
That's reasonable, but incomplete. If the comparison space has 100 hobbies, and one pair shares 1 of 4 while another pair shares 4 of 16, are those two pairs equally similar? Arguably not, since the second pair shares a greater proportion of the total possible. Then the similarity measure should take this into account, so if N=the total number of hobbies, S=the number of hobbies shared by a pair, and T=the total number of distinct hobbies they have together, their similarity is (S/T) * (T/N), or simply S/N.
So from first logical principles, the solution is a three-step:
Step 1: count hobbies, then collect hobbies pairwise by user:
Step 2: Count, calculate and order by the percentages:
But there is no hard and fast rule that determines a uniquely correct measure of pairwise similarity. Some circumstances may require the above formula. Others may require simple or complicated weights computed from pair and population sizes. Implement them by applying a corrective calculation to the denominator