Pairwise matchmaking

from the Artful Common Queries page


Given tables tracking users and their hobbies, how do we write a query that ranks pairs of users on hobby similarity?

DROP TABLE IF EXISTS users,hobbies,users_hobbies;
CREATE TABLE users( id int, name char(16) ) ;
INSERT INTO users VALUES (1,'John'),(2,'Lewis'),(3,'Muhammad');
CREATE TABLE hobbies( id int, title char(16) ) ;
INSERT INTO hobbies 
VALUES (1,'Sports'),(2,'Computing'),(3,'Drinking'),(4,'Racing'),(5,'Swimming'),(6,'Photography');
CREATE TABLE users_hobbies( user_id int, hobby_id int ) ;
INSERT INTO users_hobbies 
VALUES (1,2),(1,3),(1,6),(2,1),(2,5),(2,6),(3,2),(3,5),(3,6),(1,2),(1,3),(1,6),(2,1),
(2,5),(2,6),(3,2),(3,5),(3,6),(1,2),(1,3),(1,6),(2,1),(2,5),(2,6),(3,2),(3,5),(3,6);

It's a SQL version of a famous computing problem known as nearest neighbour search or similarity search: given a metric space of K vectors, return the N most similar.

Start by defining a similarity measure for one pair of users: if user A has x hobbies and user B has y hobbies, one measure of their similarity is the number of hobbies they share, divided by the number of hobbies that either has. Is that plausible? If A and B both have hobbies 1,14,27, they are 100*3/3=100% similar. If A has 9,13 while B has 6,9,15, together they have 4 hobbies, one of which they share, so their similarity is 25%.

That's reasonable, but incomplete. If the comparison space has 100 hobbies, and one pair shares 1 of 4 while another pair shares 4 of 16, are those two pairs equally similar? Arguably not, since the second pair shares a greater proportion of the total possible. Then the similarity measure should take this into account, so if N=the total number of hobbies, S=the number of hobbies shared by a pair, and T=the total number of distinct hobbies they have together, their similarity is (S/T) * (T/N), or simply S/N.

So from first logical principles, the solution is a three-step:
  1. Write the above logic in SQL
  2. Write a query that applies that SQL to all pairs
  3. Rank the scores that result
Fortunately, aggregation collapses these three logical steps to 2:

Step 1: count hobbies, then collect hobbies pairwise by user:

SET @N = (Select Count(DISTINCT id) FROM hobbies);
SELECT @N;
+------+
| @N   |
+------+
|    6 |
+------+
SELECT a.user_id, b.user_id, Group_Concat(DISTINCT a.hobby_id) AS 'Pairwise shared hobbies'
FROM users_hobbies a
JOIN users_hobbies b ON a.user_id<b.user_id AND a.hobby_id=b.hobby_id
GROUP BY a.user_id,b.user_id;
+---------+---------+-------------------------+
| user_id | user_id | Pairwise shared hobbies |
+---------+---------+-------------------------+
|       1 |       2 | 6                       |
|       1 |       3 | 2,6                     |
|       2 |       3 | 5,6                     |
+---------+---------+-------------------------+

Step 2: Count, calculate and order by the percentages:

SELECT 
  Concat(a.user_id, ' & ', b.user_id) AS Pair,
  Round( Count( DISTINCT a.hobby_id ) / @N, 2 ) AS Pct
FROM users_hobbies a
JOIN users_hobbies b ON a.user_id<b.user_id AND a.hobby_id=b.hobby_id
GROUP BY a.user_id,b.user_id 
ORDER BY Pct DESC ;
+-------+------+
| Pair  | Pct  |
+-------+------+
| 1 & 3 | 0.33 |
| 2 & 3 | 0.33 |
| 1 & 2 | 0.17 |
+-------+------+

But there is no hard and fast rule that determines a uniquely correct measure of pairwise similarity. Some circumstances may require the above formula. Others may require simple or complicated weights computed from pair and population sizes. Implement them by applying a corrective calculation to the denominator @N in Count( DISTINCT a.hobby_id ) / @N, 2 ).

Return to the Artful Common Queries page