MySQL Paging Sorting with Data Duplication Problem (MySQL Priority Queue)

MySQL Version: 5.7.18

Problem

Assuming that the field category is not indexed and has duplicate values, the combination of order by category and limit will not work as expected.

Scene Reproduction

Table structure (reproduce the problem, two fields are enough ~)

`1`	`Table structure (reproduce the problem, two fields are enough ~)`

Sort all data by the category field: select * from ratings order by category ;

id	category
1	1
5	1
10	1
3	2
4	2
6	2
9	2
2	3
7	3
8	3

When we want to paginate the first 5 items use select * from ratings order by category limit 5 ;

The expected order of ID is 1 5 10 3 4.

But the actual result is as follows.

id	category
1	1
10	1
5	1
3	2
4	2

Maybe some friends have encountered this problem, through the search engine to solve a little, have you ever thought, you find the solution is the optimal solution? How did others arrive at this solution? MySQL Why would they do this, is it related to the version?

First throw the conclusion.

The optimal solution is to add a sort field with a unique column value after it, such as: order by category,id
Why does MySQL do this? The answer is for speed! (This optimization is available in MySQL 5.6 and later)
The suboptimal solution is to index the category after order by (why is it suboptimal?). You will have the answer after reading this article)

Finding the optimal solution

The scenario is described in the MySQL documentation 8.2.1.19 LIMIT Query Optimization as follows.

If multiple rows have identical values in the ORDER BY columns, the server is free to return those rows in any order, and may do so differently depending on the overall execution plan. In other words, the sort order of those rows is nondeterministic with respect to the nonordered columns.

One factor that affects the execution plan is LIMIT , so an ORDER BY query with and without LIMIT may return rows in different orders.

To summarize, when there are duplicate field values in the ORDER BY column, the order of the data returned by the ORDER BY statement will be different due to the presence of LIMIT.

This is the default MySQL optimization for this scenario, but if you need to ensure that the order is consistent with or without LIMIT, there is an official way to do this.

If it is important to ensure the same row order with and without LIMIT , include additional columns in the ORDER BY clause to make the order deterministic.

This is where an additional sort field (such as an ID field) is added after ORDER BY.

The above description first appeared in the MySQL 5.6 documentation, and this optimization for ORDER BY LIMIT has been introduced since this version.

Well, for the scenario in the article, we just need select * from ratings order by category,id; to solve it.

So the question is, why did MySQL make such a seemingly buggy optimization?

MySQL’s ORDER BY Logic

ORDER BY is the sort.

Do an explain select * from ratings order by category limit 5 ;

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ratings
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 10
     filtered: 100.00
        Extra: Using filesort
1 row in set, 1 warning (0.00 sec)

You can see that Extra: Using filesort indicates that sorting is required.

Under normal circumstances, MySQL will have both in-memory and external sorting.

If the amount of data to be sorted is less than the sort buffer size, the sort is done in memory ( quick sort)

If the amount of data to be sorted is larger than the sort buffer size, the sort is done externally using a temporary file ( combine sort)

Obviously, these two kinds of sorting are to sort all the results, and reasonably, regardless of whether there is LIMIT or not, the number of items needed is taken from the sorted results in order, and whether there is LIMIT or not will not affect the order of the returned results.

However, MySQL 5.6 has a small optimization for ORDER BY LIMIT (when the sorted field is not indexed and the column value is not unique): the optimizer uses priority queue when it encounters an ORDER BY LIMIT statement.

The optimization is described by the following pseudo-code in filesort.cc.

while (get_next_sortkey())
   {
     if (using priority queue)
       push sort key into queue
     else
     {
       try to put sort key into buffer;
       if (no free space in sort buffer)
       {
         do {
           allocate new, larger buffer;
           retry putting sort key into buffer;
         } until (record fits or no space for new buffer)
         if (no space for new buffer)
         {
           sort record pointers (all buffers);
           dump sorted sequence to 'tempfile';
           dump Merge_chunk describing sequence location into 'chunk_file';
         }
       }
       if (key was packed)
         tell sort buffer the actual number of bytes used;
     }
   }
   if (buffer has some elements && dumped at least once)
     sort-dump-dump as above;
   else
     don't sort, leave sort buffer to be sorted by caller.

The optimization logic is described in WL#1393: Optimizing filesort with small limit:

Many web customers have to do
"SELECT ... ORDER BY non_index_column LIMIT X",

When X *  is smaller than sort_buff_size we can use
the following algoritm to speed up the sort:

- Create a queue to hold 'limit' keys.
- Scan through the table and store the first (last if DESC) keys in the queue
- Return values from queue

This is much faster than the current algoritm that works as:

The result of the optimization is documented in the WorkLog: 10 to 20 times faster than a quicksort (for those interested, go read the original article).

So, it’s all about speed!

MySQL thinks this scenario is a TOP N problem, which can be solved by using priority queue.

priority queue

priority queue is actually a heap, Java has java.util.PriorityQueue class, the essence of which is heap this data structure.

A brief explanation of what a heap is.

A heap is a complete binary tree.

Each node in the heap must have a value greater than or equal to (big top heap) or less than or equal to (little top heap) the value of each node in its subtree.

If MySQL uses merge or fast-ranking, you need to sort all the data and then take the first few entries of LIMIT, and the remaining sorted data will be wasted.

If you use priority queue, you can maintain a heap based on the number of LIMIT items, and you only need to go through all the data in this heap to get the result.

You can verify that MySQL is using priority queue by using the following statement.

1
2
3

SET optimizer_trace='enabled=on';
select * from ratings order by category limit 5;
SELECT * FROM `information_schema`.`OPTIMIZER_TRACE`\G;

 "filesort_priority_queue_optimization": {
              "limit": 5,
              "chosen": true
            },

You can see that filesort_priority_queue_optimization.chosen = true.

Here is a flowchart to restore the execution logic of the priority queue (using LIMIT 5 as an example):

tip: the small top heap in the diagram is sorted by the size of the category value

Take the first five data items to form a small top stack.
Take the next row of data (6,2) and find that 2 is smaller than the largest category 3 in the current heap, so delete (2,3) from the heap and add (6,2) to the heap.
Repeat step 2 until all the data that meet the query criteria have been compared into the heap, and the final data in the heap is as shown in the figure.

The above is the process of finding the smallest 5 rows of category data by priority queue.

Finally, we can get the result by taking it out of the heap and putting the last element into the top of the heap each time after the smallest element is taken out, and re-stacking it according to the small top heap.

As you can see, this result is consistent with the output of select * from ratings order by category limit 5;

Why indexing is a suboptimal solution

Obviously, according to the logic of ORDER BY, adding indexes to sorted fields directly can also eliminate the in-memory sorting step, thus solving the problem.

But indexing is not a silver bullet, and the extra category index will increase the maintenance cost of the table. If there is no obvious business need to add indexes simply to bypass the optimization of this priority queue, it is not worth the loss.

Especially when the table data volume is very large, the volume of indexes can be substantial. Moreover, for the scenario in the article, category as a category field, the duplication rate will be relatively high, even if there is a business SQL query by category, MySQL will not necessarily pick this index.

To sum up, I think order by category,id is the optimal solution for this scenario.

Table of Contents