Treewalks with CTEs

from the Artful Common Queries page


(If you're unfamiliar with SQL processing of trees, consider spending a little time with our chapter on trees and other hierarchies in MySQL (it's free), before tackling this piece. If you're new to CTEs, you might want to read the MySQL manual page on them.)

Given a table infotree( id, parentid, name ) containing one or more trees---i.e., one or more connected acyclic graphs)---how do we use CTEs, available in MySQL since version 8.0 and MariaDB since version 10.2.2, to walk the subtree of a given node?

It's a standard CTE WITH loop. There are just two tricky bits.

First, MySQL is fussy about CTE string columns overflowing with data; the first occurrence of such a column needs to declare the string long enough to accommodate the longest value the tree can fill it with.

And, to order the result in a way that shows tree layout, include a tree path result column that orders the result in treewalk order:

set @root=5;                           -- subtree root value
with recursive treewalk as (
  select id, 0 as level, cast( id as char(64) ) as path, name
  from infotree
  where id=@root                       -- query for subtree root 
  union
  select                               -- query for nodes
    t.id, tw.level+1 as level,
    concat( path, ',', t.id ) as path, -- path down to this node
    t.name
  from infotree t
  join treewalk tw on t.parentid=tw.id
)
select * from treewalk order by path;

Do CTEs also make sums down subtrees easier? Yew bet they do. Consider this toy table ...

drop table if exists t;
create table t (
  id int primary key,
  parentid int,
  name varchar(10),
  sales int
);
insert into t values
(1, null, 'Vehicles', 0),
(2, 1,    'Cars',     0),
(3, 1,    'Bikes',    0),
(4, 2,    'Ford',    10),
(5, 4,    'Mustang',  7),
(6, 4,    'Focus',    4),
(7, 3,    'Electra',  5),
(8, 3,    'Trek',     8);

To get our result, walk subtrees in the tree table and join that result with GROUP BY sums ...

with recursive cte as (
  select t.id, t.name, t.sales, t.id as rootid -- seed row
  from t 
  union all
  select t.id, t.name, t.sales, cte.rootid     -- repeat to every leaf edge
  from t
  join cte on t.parentid = cte.id          
)
select 
  rootid, name, sales, 
  sum(sales) as subtree_sales                  -- aggregate
from cte
group by rootid
order by rootid;
+--------+----------+-------+---------------+
| rootid | name     | sales | subtree_sales |
+--------+----------+-------+---------------+
|      1 | Vehicles |     0 |            34 |
|      2 | Cars     |     0 |            21 |
|      3 | Bikes    |     0 |            13 |
|      4 | Ford     |    10 |            21 |
|      5 | Mustang  |     7 |             7 |
|      6 | Focus    |     4 |             4 |
|      7 | Electra  |     5 |             5 |
|      8 | Trek     |     8 |             8 |
+--------+----------+-------+---------------+

Yes these are simple examples with tiny rowcounts. But with the basic concepts in hand, more demanding queries are possible.

Suppose reads of items listed in the infotree table used in the first example above are tracked in visitlog( id int unsigned primary key, infoid int unsigned, dt timestamp ), and you wish to see read counts summed down all subtrees starting from a specified root.

We'll need two treewalks---one to populate a tree of all item descendants of the specified root item, a second to walk down all subtrees of the first CTE result, summing as we go.

Our aim is one query incorporating all required CTEs because that'll perform better, but if you're new to CTEs it can be a challenge to follow the cascading references, so first we just save the first CTE query result to an intermediate table that the second second CTE can run against.

Suppose we want the subtree starting at infotree node id=5. First build the tree, saving the result in a table ...

set @root=5;
create table tmp as
  with recursive treewalk as ( 
    select 
      id, parentid, 0 as level, cast( id as char(64) ) as path, 
      upper(name) as name
    from infotree 
    where id=@root                       -- row for subtree root  
    union 
    select                               -- query the nodes 
      t.id, t.parentid, tw.level+1 as level, 
      concat( path, ',', t.id ) as path, -- save path down to this node 
      concat( repeat( '.', level+1 ),    -- indent by path depth
              if( c.id is not null, upper(t.name), t.name )
            ) as name
    from infotree t 
    left join infotree c on t.id=c.parentid
    join treewalk tw on t.parentid=tw.id 
  ) 
  select                                 -- fetch rows
    t.id, t.parentid, t.level, t.path, t.name, v.N as Total
  from treewalk t
  join(                                  -- join to aggregates
    select infoid, count(*) as N
    from visitlog_all
    group by infoid
  ) v on t.id=v.infoid
  order by t.path;                       -- order list by item path

Now to this result apply the logic of subtree sums we used above in the toy sales table example:

with recursive cte as (
  select tmp.id, tmp.id as rootid, tmp.Total
  from tmp                               -- seed rows
  union all
  select tmp.id, cte.rootid, tmp.Total   -- walk all subtrees
  from tmp
  join cte on tmp.parentid = cte.id          
)
select                                   -- fetch subtree walk rows
  tmp.id, tmp.name, 
  tmp.Total, sums.SubTreeTotal
from tmp
join (                                   -- join them to the sums
  select rootid, sum(Total) as SubTreeTotal
  from cte
  group by rootid
) sums on sums.rootid=tmp.id
order by tmp.path ;

Does it perform? 3 secs on an old server to walk and sum 2K tree items against 2M visit rows, even with the write to a tmp table.

Now put it all into one query with three CTEs named treewalk, rowsums and treesums. Remember, with MySQL CTEs the RECURSIVE keyword is required if any of the WITH subqueries is recursive:

set @root=5;
with recursive 
  treewalk as (  
    select  
      id, parentid, 0 as level, cast( id as char(64) ) as path,  
      upper(name) as name 
    from infotree  
    where id=@root                       -- row for subtree root   
    union  
    select                               -- query the nodes  
      t.id, t.parentid, tw.level+1 as level,  
      concat( path, ',', t.id ) as path, -- save path down to this node  
      concat( repeat( '.', level+1 ),    -- indent by path depth 
              if( c.id is not null, upper(t.name), t.name ) 
            ) as name 
    from infotree t  
    left join infotree c on t.id=c.parentid 
    join treewalk tw on t.parentid=tw.id  
  ),
  rowsums as (
    select                               -- fetch rows 
      t.id, t.parentid, t.level, t.path, t.name, v.N as Total 
    from treewalk t 
    join(                                -- join to aggregates 
      select infoid, count(*) as N 
      from visitlog_all 
      group by infoid 
    ) v on t.id=v.infoid 
    order by t.path                      -- order list by item path 
  ),
  treesums as ( 
    select                               -- seed rows
      rowsums.id, rowsums.id as rootid, 
      rowsums.Total 
    from rowsums  
    union all 
    select 
      rowsums.id, treesums.rootid, 
      rowsums.Total                      -- walk all subtrees 
    from rowsums 
    join treesums on rowsums.parentid = treesums.id           
  ) 
select                                   -- fetch rowsums 
  rowsums.id, rowsums.name,  
  rowsums.Total, sums.SubTreeTotal 
from rowsums 
join (                                   -- join them to subtree sums 
  select rootid, sum(Total) as SubTreeTotal 
  from treesums 
  group by rootid 
) sums on sums.rootid=rowsums.id 
order by rowsums.path ;                  -- tree path order

It returns its result in 1.4 secs against 2K infotree rows and 2M visit rows.

It would be useful as a parameterised View, but unfortunately MySQL doesn't (yet?) support parameterised Views. Then make it an sproc with a param for the subtree root ...

drop procedure if exists treereads;
delimiter go
create procedure treereads( root int unsigned )
begin
  with recursive  
    treewalk as (   
      select   
        id, parentid, 0 as level, cast( id as char(64) ) as path,   
        upper(name) as name  
      from infotree   
      where id=root                        -- row for subtree root    
      union   
      select                               -- query the nodes   
        t.id, t.parentid, tw.level+1 as level,   
        concat( path, ',', t.id ) as path, -- save path down to this node   
        concat( repeat( '.', level+1 ),    -- indent by path depth  
                if( c.id is not null, upper(t.name), t.name )  
              ) as name  
      from infotree t   
      left join infotree c on t.id=c.parentid  
      join treewalk tw on t.parentid=tw.id   
    ), 
    rowsums as ( 
      select                             -- fetch rows  
        t.id, t.parentid, t.level, t.path, t.name, v.N as Total  
      from treewalk t  
      join(                              -- join to aggregates  
        select infoid, count(*) as N  
        from visitlog_all  
        group by infoid  
      ) v on t.id=v.infoid  
      order by t.path                    -- order list by item path  
    ), 
    treesums as (  
      select rowsums.id, rowsums.id as rootid, rowsums.Total  
      from rowsums                       -- seed rows  
      union all  
      select  
        rowsums.id, treesums.rootid,  
        rowsums.Total                    -- walk all subtrees  
      from rowsums  
      join treesums on rowsums.parentid = treesums.id            
    )  
  select                                 -- fetch rowsums  
    rowsums.id, rowsums.name,   
    rowsums.Total, sums.SubTreeTotal  
  from rowsums  
  join (                                 -- join them to subtree sums  
    select rootid, sum(Total) as SubTreeTotal  
    from treesums  
    group by rootid  
  ) sums on sums.rootid=rowsums.id  
  order by rowsums.path ;                -- tree path order 
end;
go
delimiter ;


Return to the Artful Common Queries page