Writing queries that are optimized

When writing a query there are basic parts:
1) the result set
2) the tables to get the data from
3) the where clause – provides filtering
4) group by for grouping data to be aggregated.

This article will discuss areas 2 and 3 since that’s where most query optimizations occur.

Joining Tables efficiently:

Mysql has a practical limit to the number of tables you should join. It is 10 (some opinions differ).
While Mysql can join more than that but statement performance may suffer due to the calculation of the statement execution plan. The more tables Mysql has to deal with, the greater number of plans it has to explore when deciding
on how to execute the statement.

When you join 2 tables you want to make sure the columns you join on are indexed in at least of of the tables.
If the columns are indexed, Mysql can find the row in the joined table quickly via an index lookup.

Ex:
Select xxx,yyy,xxx
from table_a
join table_b on table_b.indexed_column = table_a.indexed_column

In this example the column ‘indexed_column’ represents a column that has an index on it in
table_a and/or table_b.

So for each row in table_a, the select will quickly find the matching row in table_b due to the join condition which is on the ‘indexed_column’.

Now if we join the tables on columns that are not indexed in either table_a or table_b.

Select xxx,yyy,xxx
from table_a
join table_b on table_b.unindexed_column = table_a.unindexed_column

Mysql will have to perform a table scan of table_b for each row in table_a.
So if table_a has 100 rows in it and table_b has 1m rows that means that Mysql will have to scan table_b 100 times scanning a total of 100 million rows.
This is very inefficient.

Subselects as tables:

Mysql materializes subselects. What that means is the first time the subselect is run, Mysql creates a temporary table to store the result set so Mysql does not have to re-execute subselect. Note: if your result set is very large, say hundreds of thousands of rows, that can create a very large temp table.

Many times I have see queries that use a subselect as a table. That is fine.
However you must remember that if you join a subselect to another table, the column used for the join condition must be indexed in the joined table.
If the joined table does not have an index on the column used in the join condition then the joined table will be scanned for each row in the subselect result set.

That also means that you should not join one subselect to another subselect since the result sets of subselects are not indexed.

Another thing to keep in mind is that the columns used in join conditions must be the same datatype. If they are not the same datatype and there is an index on the columns, Mysql can not make use of the indexes and will end up performing table scans.

Something else you need to know is that Mysql is pretty smart. When it creates it’s join plan, it can detect which table should be the main table and which should be the joined table.
For example, if the join condition column is indexed in table_a but not in table_b, it will join table_a to table_b using table_b as the primary table. It also means that Mysql may not join the tables in the same order you think it should. I have seen situations where Mysql did not ‘guess’ correctly.
To override Mysql’s guess, you can specify ‘STRAIGHT_JOIN’ after the SELECT statement. This will cause your tables to be joined in the order you specify.
The primary table is where Mysql starts getting rows from first.

The ‘WHERE’ clause:

This is the most important part of an sql statement as it filters out rows.
If you don’t use or it is not used correctly, your select will be performing full tables scans.
This is very very bad.

Using a WHERE clause is easy. You simply specify a column = another_column AND/OR a column = a CONSTANT.

Anytime you filter on a column the column should be indexed.
Also never change the datatype of the column you are filtering on (as in using CAST). This prevents Mysql from using any associated index.
Mysql can only use an index when the column is used in it’s ‘native’ datatype.

The datatypes of the column you are filtering on and the constant/column you are comparing to must match. If you need to change the datatype, change the datatype of the constant/column and not that of the indexed column.

Using an ‘IN’ clause:
Limit the size of the items inside an ‘IN’ clause. If you have 50 items that’s fine. Don’t have 25 million.
There are better ways to filter.

Subselect in a WHERE clause:

Using a subselect in a WHERE clause is okay if the subselect returns a small result set.
If your subselect returns a very large result set, you may be better off simply joining the tables in the subselect like normal tables.

Get rid of subselects:

Due to the inability of subselects to be indexed you can get rid of them.
What you do is to create a temporary table. If you specify ‘ENGINE=innodb’, the table will be created as an innodb table and will exist in the innodb buffer pool. This avoids eating up the servers free RAM (reduces the chances of out of memory situation). When you create this temporary table, you can create indexes.
Then, in your query you can join this temporary table just like a normal table on the indexed columns.
Maintaining the indexes (upon insert into the table) will cause a minor performance hit but the performance gain from having the indexes will more than offset that.

The ‘Takeaway’:

The takeaway from this document is that INDEXES are one of the best performance boosters. Create them and use them.

You can not use any function (such as UPPER(), LOWER(), user_defined_function(), …..) on the indexed column or Mysql will not be able to use it.