Track state changes

from the Artful Common Queries page


You have a table containing a sequence of IDs, states and dates. The query requirement is to list state changes and their dates ordered by date, and count instances of each sequenced state. This data from a blog entry by a Russian MySQLer calling himself Quassnoi ...

+----+-------+------------------+
|ID  | State |  Datetime        |
+----+-------+------------------+
| 12 |   1   | 2009-07-16 10:00 |
| 45 |   2   | 2009-07-16 13:00 |
| 67 |   2   | 2009-07-16 14:40 |
| 77 |   1   | 2009-07-16 15:00 |
| 89 |   1   | 2009-07-16 15:30 |
| 99 |   1   | 2009-07-16 16:00 |
+----+-------+------------------+

should produce this result ...

+----+-------+------------------+-------+
|ID  | State |  Datetime        | Count |
+----+-------+------------------+-------+
| 12 |   1   | 2009-07-16 10:00 |   1   |
| 45 |   2   | 2009-07-16 13:00 |   2   |
| 77 |   1   | 2009-07-16 15:00 |   3   |
+----+-------+------------------+-------+

To test query efficiency, we need more than six rows. To generate sequential test data, a utility table of ints from 0 through 9 is helpful. We keep general-purpose database objects in a sys database available to all developers on the server:

create database if not exists sys;
use sys;
drop table if exists ints;
create table ints(i int);  /* See "Make a table of sequential ints" */
insert into ints values(0),(1),(2),(3),(4),(5),(6),(7),(8),(9); 

Quassnoi's test data was 10,000 rows tracking ID, state, date and value, with state oscillating between 1 and 2 every one thousand rows. The primary key is ID. There is a covering (date, state) index. Dates are in descending order. To simplify comparative testing, we parameterise the state-change interval:

set @state_interval=1000;
drop table if exists history;
create table history (
  id int primary key,
  state int not null,
  date datetime not null,
  value varchar(16) not null,
  key(date, state)
) engine=innodb default charset=utf8;
insert into history
 select
   id,
   1 + floor(((id-1)/@state_interval) mod 2),
   '2009-07-24' - interval id minute,
   concat('value ', id)
from (
  select 1 + ( t.i*1000 + u.i*100 + v.i*10 + w.i ) as id
  from sys.ints t
  join sys.ints u
  join sys.ints v 
  join sys.ints w
) tmp;


To the original requirement we add reporting the value associated with each state change. Quassnoi found a set-based solution, but it is painfully slow:

SELECT MIN(id) AS id, MIN(date) AS date, MIN(state) AS state, value, COUNT(*) cnt
FROM (
  SELECT  
    id, date, state,value,
    ( SELECT id
      FROM history a
      WHERE (a.date, a.state, a.id) < (b.date, b.state, b.id) AND a.state <> b.state
      ORDER BY date DESC, state DESC
      LIMIT 1
    ) AS prev
  FROM history b
) q
GROUP BY prev
ORDER BY date;
+------+---------------------+-------+-------------+------+
| id   | date                | state | value       | cnt  |
+------+---------------------+-------+-------------+------+
| 9001 | 2009-07-17 01:20:00 |     2 | value 10000 | 1000 |
| 8001 | 2009-07-17 18:00:00 |     1 | value 9000  | 1000 |
| 7001 | 2009-07-18 10:40:00 |     2 | value 8000  | 1000 |
| 6001 | 2009-07-19 03:20:00 |     1 | value 7000  | 1000 |
| 5001 | 2009-07-19 20:00:00 |     2 | value 6000  | 1000 |
| 4001 | 2009-07-20 12:40:00 |     1 | value 5000  | 1000 |
| 3001 | 2009-07-21 05:20:00 |     2 | value 4000  | 1000 |
| 2001 | 2009-07-21 22:00:00 |     1 | value 3000  | 1000 |
| 1001 | 2009-07-22 14:40:00 |     2 | value 2000  | 1000 |
|    1 | 2009-07-23 07:20:00 |     1 | value 1000  | 1000 |
+------+---------------------+-------+-------------+------+

The correlated SELECT ... ORDER BY ... LIMIT 1 subquery may look dodgy, but a conventional call to MIN() is even slower:

SELECT MIN(id) AS id, MIN(date) AS date, MIN(state) AS state, COUNT(*) cnt
FROM (
  SELECT  
    id, date, state,
    ( SELECT min(id)
      FROM history a
      WHERE (a.date, a.state, a.id) < (b.date, b.state, b.id) AND a.state <> b.state
    ) AS prev
  FROM history b
) tmp
GROUP BY prev
ORDER BY date;

The solution is to build a derived table ordered by date and state with user variables to track state changes. The derived column state_change starts at 0 and increments whenever state changes. Then selecting minimum date and state values from the derived table, and grouping them by state_change, picks out the rows where state changes. This is hundreds of times faster than either of the set-based solutions:

SELECT MIN(id) AS id, MIN(date) AS date, MIN(state) AS state, value, COUNT(*) cnt
FROM (
  SELECT 
    @r := @r + (@state != state OR @state IS NULL) AS state_change,
    @state := state AS current,
    h.id, h.date, h.state, h.value
  FROM (
    SELECT @r := 0, @state := NULL ) as vars  -- one-row virtual table
    CROSS JOIN history as h                   -- cross-joined with history
    ORDER BY date, state                      -- ordered to track state changes
  ) q
GROUP BY state_change
ORDER BY date;

If how this works puzzles you, run the query with state_change added to the query's SELECT list. In other SQL dialects, in fact, you would have to include the state_change column, since standard SQL requires that GROUP BY columns be SELECTed.

Return to the Artful Common Queries page