current position:Home>MySQL's golden rule: "don't use select *"

MySQL's golden rule: "don't use select *"

2022-05-15 07:42:56Java architecture design

“ Do not use SELECT *” It has almost become MySQL A golden rule to use , Even 《 Ali Java Development Manual 》 It is also clearly stated that * List of fields as query , It also gives this rule the blessing of authority .

But I use it directly in the development process SELECT * There are still more , There are two reasons :

  1. Because of the simple , Very efficient development , And if you add or modify fields frequently later ,SQL The statement does not need to be changed ;
  2. I think premature optimization is a bad habit , Unless you can determine at the beginning what fields you actually need in the end , And build an appropriate index ; otherwise , I choose to tell you when I'm in trouble SQL To optimize , Of course, if the trouble is not fatal .

But we have to know why we don't recommend using it directly SELECT *, This paper starts from 4 Give reasons in three aspects .

1. Unnecessary disk I/O

We know MySQL Essentially, user records are stored on disk , Therefore, the query operation is a kind of disk operation IO act ( The premise is that the records to be queried are not cached in memory ).

The more fields to query , Explain that the more content to read , Therefore, the disk size will be increased IO expenses . Especially when some fields are TEXT、MEDIUMTEXT perhaps BLOB And so on , The effect is particularly obvious .

That use SELECT * Will it make MySQL Take up more memory ?

Not in theory , Because for Server In terms of layers , It is not to store the complete result set in memory and then transfer it to the client all at once , Instead, each row obtained from the storage engine , It's called net_buffer In memory space of , The size of this memory is determined by the system variable net_buffer_length To control , The default is 16KB; When net_buffer Write the memory space of the local network stack after it is full socket send buffer Write data to the client , Send successfully ( Client read complete ) Empty after net_buffer, Then continue to read the next line and write .

in other words , By default , The result set takes up the most memory space, but net_buffer_length It's just size , It won't take up extra memory space just because of a few more fields .

2. Increase network delay

Take on the last point , Although I always put socket send buffer The data in is sent to the client , It seems that the amount of data is small at a time , But it can't hold. Someone really uses it * hold TEXT、MEDIUMTEXT perhaps BLOB The type field is also found , The total amount of data is large , This directly leads to more network transmission times .

If MySQL Not on the same machine as the application , This kind of expense is very obvious . Even if MySQL The server and the client are on the same machine , The protocol used is still TCP, Communication also takes extra time .

3. Cannot use overlay index

To illustrate the point , We need to build a watch

CREATE TABLE `user_innodb` (
  `name` varchar(255) DEFAULT NULL,
  `gender` tinyint(1) DEFAULT NULL,
  `phone` varchar(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `IDX_NAME_PHONE` (`name`,`phone`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
 Copy code 

We created a storage engine for InnoDB Table of user_innodb, And set up id Primary key , In addition to name and phone Created a federated index , Finally, the table is initialized randomly 500W+ Data .

InnoDB The primary key will be automatically id Create a tree named primary key index ( Also known as clustered index ) Of B+ Trees , This B+ The most important feature of the tree is that the leaf node contains complete user records , It looks something like this .

If we execute this statement

SELECT * FROM user_innodb WHERE name = ' Cicadas bathe in the wind ';
 Copy code 

Use EXPLAIN Check the execution plan of the statement :

Found this SQL Statement will use IDX_NAME_PHONE Indexes , This is a secondary index . The leaf node of the secondary index looks like this :

InnoDB The storage engine will find... In the leaf node of the secondary index according to the search criteria name A record of bathing the cicada in the wind , But the secondary index only records name、phone And the primary key id Field ( Who let us use SELECT * Well ), therefore InnoDB You need to hold the key id Go to the primary key index to find this complete record , This process is called Back to the table .

Think about it , If the leaf node of the secondary index has all the data we want , Don't you need to go back to your watch ? Yes , This is it. Overlay index .

for instance , We just happen to want to search name、phone And the primary key field .

SELECT id, name,  phone FROM user_innodb WHERE name = " Cicadas bathe in the wind ";
 Copy code 

Use EXPLAIN Check the execution plan of the statement :

You can see Extra A column shows Using index, It means that our query list and search criteria only contain columns belonging to an index , That is, the overlay index is used , Can directly abandon the operation of returning to the table , Greatly improve query efficiency .

4. May slow down JOIN Link query

We create two tables t1,t2 Connect to illustrate the next problem , And to t1 In the table 100 Data , towards t2 Inserted in 1000 Data .

  `id` int NOT NULL,
  `m` int DEFAULT NULL,
  `n` int DEFAULT NULL,
  PRIMARY KEY (`id`)

  `id` int NOT NULL,
  `m` int DEFAULT NULL,
  `n` int DEFAULT NULL,
  PRIMARY KEY (`id`)
 Copy code 

If we execute the following statement

SELECT * FROM t1 STRAIGHT_JOIN t2 ON t1.m = t2.m;
 Copy code 

I used it here STRAIGHT_JOIN Injunction t1 Table as drive table ,t2 Table as driven table

For join queries , The drive table will only be accessed once , The driven table has to be accessed many times , The specific number of accesses depends on the number of records in the drive table that match the query records . Since the driven table and driven table have been forcibly determined , Now let's talk about the essence of the connection between two tables :

  1. t1 As a driving table , Filter conditions for the drive table , Execute on t1 Table in the query . Because there are no filter conditions , That is to get t1 All data of table ;
  2. For each record in the result set obtained in the previous step , To the driven table , Find matching records according to the connection filter conditions

If expressed in pseudo code, the whole process is like this :

// t1Res It's for the drive table t1 Filtered result set 
for (t1Row : t1Res){
  // t2 Table driven is complete 
  for(t2Row : t2){
  	if ( Satisfy join Conditions  &&  Satisfy t2 The filter conditions of ){
       Send to client 
 Copy code 

This is the easiest way , But at the same time, the performance is also the worst , This method is called nested loop connection (Nested-LoopJoin,NLJ). How to speed up the connection ?

One way is to create an index , It's best to be in the driven table (t2) Create an index on the field involved in the connection condition , After all, the driven table needs to be queried many times , Moreover, the access to the driven table is essentially a single table query ( because t1 The result set is fixed , Every connection t2 The query criteria are also dead ).

Now that the index is used , To avoid repeating the mistake of not being able to use overlay indexes , We should also try not to directly SELECT *, Instead, the fields actually used are used as query Columns , And build an appropriate index .

But if we don't use indexes ,MySQL Is it true to perform join query in the way of nested circular query ? Of course not. , After all, this nested loop query is too slow !

stay MySQL8.0 Before ,MySQL Provides block based nested loop connections (Block Nested-Loop Join,BLJ) Method ,MySQL8.0 Has introduced hash join Method , Both methods are proposed to solve a problem , That is to minimize the number of accesses to the driven table .

Both methods use a method called join buffer Fixed size memory area , There are several records in the result set of the drive table ( The difference between the two methods is the form of storage ), In this way , When the table is loaded into memory , Disposable and join buffer Match the records in multiple drive tables in the , Because the matching process is completed in memory , Therefore, this can significantly reduce the number of driven tables I/O cost , It greatly reduces the cost of repeatedly loading the driven table from the disk . Use join buffer The process is shown in the figure below :

Let's take a look at the execution plan of the connection query above , I found that hash join( The premise is that there is no t2 Create an index for the join query field of the table , Otherwise, the index will be used , Can't use join buffer).

The best situation is join buffer Large enough , It can accommodate all records in the result set of the drive table , In this way, you only need to access the driven table once to complete the connection operation . We can use join_buffer_size This system variable is configured , The default size is 256KB. If it doesn't fit , Put the result set of the driving table into join buffer It's in , After the comparison in memory is completed , Empty join buffer Then load the next batch of result sets , Until the connection is complete .

The key is coming. ! Not all columns of the drive table records will be placed in join buffer in , Only the columns in the query list and the columns in the filter criteria will be placed in join buffer in , So remind us again , Better not put * As a query list , Just put the columns we care about in the query list , In this way, it can also be in join buffer Put more records in , Reduce the number of batches , This naturally reduces the number of accesses to the driven table .

Link to the original text :

copyright notice
author[Java architecture design],Please bring the original link to reprint, thank you.

Random recommended