Avoiding repeat aggregation

from the Artful Common Queries page


In a good introductory tutorial on MySQL subqueries, Jeremy Cole developed a triply nested query to retrieve the name, population, head of state, and number of official languages in countries with the most official languages. The query uses two tables from the MySQL world database:

CREATE TABLE country (
  Code char(3) NOT NULL DEFAULT '' PRIMARY KEY,
  Name char(52) NOT NULL DEFAULT '',
  Population int(11) NOT NULL DEFAULT '0',
  HeadOfState char(60) DEFAULT NULL,
  ... other columns ...
);
CREATE TABLE countrylanguage (
  CountryCode char(3) NOT NULL DEFAULT '' PRIMARY KEY,
  Language char(30) NOT NULL DEFAULT '',
  IsOfficial enum('T','F') NOT NULL DEFAULT 'F',
  Percentage float(4,1) NOT NULL DEFAULT '0.0'
);

The query needs to aggregate language counts by country twice: once for all language counts by country, and once again to identify countries with the highest number of languages:

SELECT name, population, headofstate, top.num
FROM Country
JOIN ( 
  SELECT countrycode, COUNT(*) AS num
  FROM CountryLanguage
  WHERE isofficial='T'
  GROUP BY countrycode
  HAVING num = (
    SELECT MAX(summary.nr_official_languages)
    FROM  (
      SELECT countrycode, COUNT(*) AS nr_official_languages
      FROM CountryLanguage 
      WHERE isofficial='T' 
      GROUP BY countrycode
    ) AS summary
  )
) as top ON Country.code=top.countrycode;
+--------------+------------+-------------+-----+
| name         | population | headofstate | num |
+--------------+------------+-------------+-----+
| Switzerland  |    7160400 | Adolf Ogi   |   4 |
| South Africa |   40377000 | Thabo Mbeki |   4 |
+--------------+------------+-------------+-----+

In addition, one of the nested subqueries is buried in a HAVING clause. This is fine with small tables, but if the table being aggregated is very large and the aggregation is complex, performance may be unsatisfactory. Substituting a temporary table for the double nesting can improve performance in two ways:
  • the aggregation needs to be done just once
  • we can use an exclusion join, which is usually faster than a HAVING clause, to find countries with the maximum counts:

DROP TABLE IF EXISTS top;
CREATE TABLE top ENGINE=MEMORY
  SELECT countrycode, COUNT(*) AS num
  FROM CountryLanguage l1
  WHERE isofficial='T'
  GROUP BY countrycode;

SELECT name,population,headofstate,t3.num
FROM country c
JOIN (
  SELECT t1.countrycode, t1.num
  FROM top t1
  LEFT JOIN top t2 ON t1.num < t2.num
  WHERE t2.countrycode IS NULL
) AS t3 ON c.code=t3.countrycode;
+--------------+------------+-------------+-----+
| name         | population | headofstate | num |
+--------------+------------+-------------+-----+
| Switzerland  |    7160400 | Adolf Ogi   |   4 |
| South Africa |   40377000 | Thabo Mbeki |   4 |
+--------------+------------+-------------+-----+
DROP TABLE top;

You notice that we haven't actually used a TEMPORARY table? Indeed we haven't, because of the MySQL limitation that temporary tables cannot be referenced multiple times in a query. Until that's lifted, we get almost as much speed improvement from using a MEMORY table as a temporary table.

Return to the Artful Common Queries page