mysql query processing1 - stanford...

14
HW8: MySQL Query Processing CS245 Winter 2017 Due: Mar 7, 2017, 23:59pm 1 MySQL Query Processing 1 Homework 8, CS245 Winter 2017 In this assignment, we will see how MySQL parses, normalizes, and rewrites queries by looking at its code with the help of a debugger. Note that you will submit this homework electronically via Gradiance: http://www.newgradiance.com/services Taking notes while following the instructions and questions in this document will help you solve the problems in Gradiance. However, you do NOT have to submit any cleanly written answers to the questions in this document. Please start earlier! At least download the necessary files first, even if you plan to work on it later. The virtual machine disk image is large and may take some time to download. Also note that because you need to understand and operate real software, this assignment can take longer than a typical Gradiance homework. A. Setup If you have already followed the setup instructions in the “MySQL/InnoDB B+Tree” assignment, only the last subsection about GDB and DDD is new, and the first few can be safely skipped. A.1. Downloading VirtualBox disk image http://www.stanford.edu/class/cs245/data/cs245.vdi.zip (1.3GiB) A smaller 7zip file (856MiB) is also available: replace .zip with .7z from the URL above to download it if you are able to handle 7zip files, or having trouble downloading the larger one. A.2. VirtualBox After downloading the files, set up a virtual machine using the VirtualBox disk image (cs245.vdi). 1. Install VirtualBox using the provided installer, or you may choose to download the installer instead from https://www.virtualbox.org/wiki/Downloads. 2. Create a new virtual machine: a. Start VirtualBox, click the “New” button, and then enter in the following values 1 This assignment was initially created by Dennis Sidharta and revised by Jaeho Shin.

Upload: vungoc

Post on 28-May-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

HW8: MySQL Query Processing CS245 Winter 2017

Due: Mar 7, 2017, 23:59pm

1

MySQL Query Processing1 Homework 8, CS245 Winter 2017

In this assignment, we will see how MySQL parses, normalizes, and rewrites queries by looking at

its code with the help of a debugger.

Note that you will submit this homework electronically via Gradiance:

http://www.newgradiance.com/services

Taking notes while following the instructions and questions in this document will help you solve

the problems in Gradiance. However, you do NOT have to submit any cleanly written answers to

the questions in this document.

Please start earlier! At least download the necessary files first, even if you plan to work on it

later. The virtual machine disk image is large and may take some time to download. Also note that

because you need to understand and operate real software, this assignment can take longer

than a typical Gradiance homework.

A. Setup

If you have already followed the setup instructions in the “MySQL/InnoDB B+Tree” assignment,

only the last subsection about GDB and DDD is new, and the first few can be safely skipped.

A.1. Downloading ● VirtualBox disk image http://www.stanford.edu/class/cs245/data/cs245.vdi.zip (1.3GiB)

○ A smaller 7zip file (856MiB) is also available: replace .zip with .7z from the URL above to

download it if you are able to handle 7zip files, or having trouble downloading the larger one.

A.2. VirtualBox After downloading the files, set up a virtual machine using the VirtualBox disk image (cs245.vdi).

1. Install VirtualBox using the provided installer, or you may choose to download the

installer instead from https://www.virtualbox.org/wiki/Downloads.

2. Create a new virtual machine:

a. Start VirtualBox, click the “New” button, and then enter in the following values

1 This assignment was initially created by Dennis Sidharta and revised by Jaeho Shin.

HW8: MySQL Query Processing CS245 Winter 2017

Due: Mar 7, 2017, 23:59pm

2

■ Name: cs245 (or whatever name you like)

■ Type: Linux

■ Version: Red Hat (64 bit)

b. Set other settings like memory size, etc. to your liking.

c. For the hard drive, make sure to point to the provided cs245.vdi.

Click “Start” to launch the virtual machine. To login, use the following username and password:

● username: root

● password: cs_245

VirtualBox’s Guest Addition has been installed in the virtual machine. Guest Addition adds a few

features such as mouse-pointer integration with the host OS, shared folder support, etc.

A.3. MySQL We will be using MySQL Community Server version 5.6.10, which has been installed in the virtual

machine.

Here are various important folders and files that you may need to know:

● Source location: /root/workspace/mysql-5.6.10

● Installation location: /usr/local/mysql

● Configuration file: /etc/my.cnf

● Other files:

○ /var/lib/mysql/mysql.sock

○ /var/log/mysqld.log

○ /var/run/mysqld/mysqld.pid

A.3. GDB and DDD DDD has also been installed for you. DDD (http://www.gnu.org/software/ddd) provides a graphical

front-end to GDB (http://www.gnu.org/software/gdb/gdb.html). To launch it, type ddd in a

terminal.

Inspecting a program via DDD requires two steps (more on this later):

1. Loading the program.

2. Attaching DDD to the program’s running process.

HW8: MySQL Query Processing CS245 Winter 2017

Due: Mar 7, 2017, 23:59pm

3

B. Starting MySQL and Attaching DDD to It

B.1. Launching MySQL Server The commands to start and to stop MySQL server are, respectively:

● start: mysqld --debug

● stop: mysqladmin shutdown

Open a terminal, and then start MySQL server. Once the server is started, find out its process id by

executing the following command (the prompt and the output are shown):

# ps aux | grep mysqld

9323 pts/0 Sl+ 0:00 ./mysqld --debug

In the example above, the process id is 9323. Make a note of the id printed in your terminal; we

will use it soon. Do not shutdown the server at the moment.

B.2. MySQL Client On a separate terminal, start a MySQL client by executing mysql -uroot. Once it is started, set

the default DB to employee_db by executing use employee_db;

Do not terminate the client for the moment, but if you need to, simply execute exit.

B.3. DDD

B.3.1. Attaching to MySQL

On a separate terminal, start DDD by typing ddd. Once started, attach it to the currently running

MySQL server process:

1. Select “File” from the menu, “Open Program...”, and then enter the following in the

“Program” field: /usr/local/mysql/bin/mysqld

Alternatively, you can start DDD with the path to the program as an argument to the

command:

ddd /usr/local/mysql/bin/mysqld

2. In the GDB console (at the bottom window), type the following:

attach process_id

where process_id is the one you noted in section B.1, e.g., 9323. If the GDB console is not

shown at the bottom of DDD’s screen, activate it by selecting “View...” from the menu and

then “GDB Console.”

HW8: MySQL Query Processing CS245 Winter 2017

Due: Mar 7, 2017, 23:59pm

4

Because MySQL server process runs as a different user (mysql) than yourself (root), you

cannot use DDD’s “Attach to Process...” feature in its “File” menu to simply select mysqld

from the list, but must figure out the process id by running a separate command.

B.3.2. Looking up Source Code Elements You can easily bring up the part of MySQL’s source code you want to see using DDD’s Lookup

feature. Enter a query in the text input box at the top of DDD’s window, then press the Return or

Enter key, or click the “Lookup” button right next to it. The query can be many things including the

following ones:

File name with an optional line number, e.g., sql_parse.cc:1134

Function name, e.g., dispatch_command

Global variable name, e.g., thread_scheduler

Class name, e.g., THD

Type name (structs, enums, typedefs, etc.), e.g., scheduler_functions and

enum_server_command

Unfortunately, DDD cannot lookup macro or enum constant names, e.g., MYSQL_CALLBACK or

COM_QUERY, as such information is not available as debug symbols in the compiled binary.

Although looking up such information will not be critical for this assignment, using dedicated

analysis tools will definitely help you quickly navigate through the source code. Please consider

using cscope or ctags with your favorite editor if you want to dive deeper into the source code.

A quick web search will teach you how to use them.

B.3.3. Setting Breakpoints You can now set breakpoints via the GDB console. For example, to put a break point in sql_parse.cc

at line 1134:

break /root/workspace/mysql-5.6.10/sql/sql_parse.cc:1134

Alternatively, from the source code you brought up in DDD’s source window, right-clicking on a

line will let you set or delete breakpoint on it. Displaying the line numbers in the source window

by selecting “Source” from the menu and enabling “Display Line Numbers” will help.

You would have noticed that attaching to a running process pauses its execution. In order to let it

continue running, and eventually reach the breakpoints, we must resume the process afterwards.

Remember to always resume the MySQL server process after setting up your breakpoints by

either typing cont (or simply c) into the GDB console, or using the “Cont” button in the middle of

the floating window.

HW8: MySQL Query Processing CS245 Winter 2017

Due: Mar 7, 2017, 23:59pm

5

1. Debugging Basics

The goal of this problem is to familiarize yourself with the environment to debug MySQL server.

We will inspect how MySQL server creates threads and handles connections. Unless noted

otherwise, the source files that we will look at in this problem are all located in

/root/workspace/mysql-5.6.10/sql. Also, we will use the following convention to refer to a function:

file_name:function_name().

MySQL server creates a thread for each new connection.

1. In mysqld.cc:handle_connections_sockets(), you will find an infinite loop,

listening for new connections. Take a few minutes to look inside this function. You do not

have to understand everything that is going on. What is the name of the static function

in mysqld.cc that creates the new thread?

2. Eventually, the thread’s control is handed over to

sql_connect.cc:do_handle_one_connection(), sql_parse.cc:do_command(), and

then to sql_parse.cc:dispatch_command(), etc. Take a peek at those functions. Notice

that thd is heavily used, and is passed around between many functions. It is of the type

THD, which is defined in sql_class.h. What are the direct parent classes of THD?

3. thd holds, among other things all of the information related to the current thread. We

will inspect the contents of this object, and so set a breakpoint at the first statement in

sql_parse.cc:do_command(). Report the command that you used. Because you may see

other commands sent by MySQL client while it starts up, it is recommended to set the

breakpoint after you have a MySQL client ready to accept your input.

4. To get to the breakpoint, first continue the execution of MySQL server, and run the

following SQL query in the MySQL client:

select * from employee limit 1;

Once the execution stops at the breakpoint, double-click thd to show that variable in

DDD’s graphical data window. Notice that double clicking a variable generates a

command in the GDB console. What was the generated command?

5. Move the execution forward by typing n in the GDB console (or clicking the “Next” button)

until you reach the statement that accesses net->read_pos. Here, thd-

>net.read_pos stores the raw query read from the network. Show its value in the GDB

console by executing the following in the console: p thd->net.read_pos. What is

thd->net.read_pos’s value? Another way to discover a variable’ value is by double-

HW8: MySQL Query Processing CS245 Winter 2017

Due: Mar 7, 2017, 23:59pm

6

clicking the variable in the graphical data window. Explore thd by double-clicking it

further.

6. Notice that the extra byte prefixing the raw query stored in thd->net.read_pos. That

first byte tells us the query’s type. In fact, it is an enum enum_server_command,

defined in /root/workspace/mysql-5.6.10/include/mysql_com.h. Execute the following

command:

p (enum enum_server_command) ((uchar) thd->net.read_pos[0])

What is the type of the query we executed in question 4?

HW8: MySQL Query Processing CS245 Winter 2017

Due: Mar 7, 2017, 23:59pm

7

2. Query Parsing

In this problem, we will look at MySQL’s context-free-grammar rules for its SQL statements, and

then we will generate a parse tree of a valid statement.

We will use the employee table from the employee_db, whose schema looks like:

MySQL uses GNU Bison (http://www.gnu.org/software/bison/) to generate parsers of SQL

statements. The context free grammar rules for the statements are defined in

/root/workspace/mysql-5.6.10/sql/sql_yacc.yy. Briefly skim this file.

You will notice various defines, function declarations, terminal symbols (those prefixed by

%tokens), and grammar rules. Look at the rules for parsing a select statement. The first two of

such rules are reproduced below:

create_select:

SELECT_SYM

{

// ...

}

select_options select_item_list

{

// ...

}

opt_select_from

{

// ...

}

;

select_options:

/* empty*/

| select_option_list

{

// ...

}

;

HW8: MySQL Query Processing CS245 Winter 2017

Due: Mar 7, 2017, 23:59pm

8

Here are a few notes about the above rules:

1. SELECT_SYM is a terminal symbol, and it is defined earlier in the file as %token

SELECT_SYM. It represents a reserved MySQL keyword. The value of the reserved

keyword, which in this case is “SELECT,” can be found in lex.h.

2. In the two rules reproduced above (create_select and select_options), we did

not show the semantics definitions (also called “actions”), which are C statements

declared inside the curly braces. An action defined for a particular rule is executed

whenever that rule matches. For this problem, you do not need to understand actions.

3. Therefore, we can see that create_select is SELECT_SYM select_options

select_item_list and opt_select_from;

4. “|” signifies alternatives. And so, select_options is either an empty string or a

select_option_list.

For example, the figure below shows the parse tree of the following SQL statement:

select NULL;

Explore the related grammar rules, and then draw a parse tree of the following SQL statement:

select * from employee where dept = 'engr' limit 10;

You may immediately replace IDENT, TEXT_STRING, and NUM symbols with any identifiers, text

strings, and numeric values, respectively.

HW8: MySQL Query Processing CS245 Winter 2017

Due: Mar 7, 2017, 23:59pm

9

3. Query Normalization

The goal of this problem is to see MySQL’s internal representation of a query. As in problem 2, we

will use the employee table from the employee_db.

As you explore MySQL’s source code, you will encounter Item class very often. Item is defined in

item.h along with many other classes derived from it. It is used to manipulate various items, such

as fields, functions, etc.

In the previous problem, we generated the parse tree of the following statement:

select * from employee where dept = 'engr' limit 10;

The enum_server_command of the statement above is COM_QUERY. Take a moment to skim

sql_parse.cc:dispatch_command() and then sql_parse.cc:mysql_parse(). Notice that the

former calls the latter.

Start MySQL server, MySQL client, and DDD. And then, set a breakpoint in

sql_parse.cc:mysql_parse() at the line immediately after the following statement:

err = parse_sql(thd, parser_state, NULL)

which should be at around line 6055. Once set, execute the query above.

The parsed result is stored in thd->lex->select_lex. And so, when the program stops at the

breakpoint, inspect the contents of select_lex. You may navigate to select_lex via DDD’s

visualization by double-clicking thd, and then lex (hint: lex is defined in the Statement

class, which is THD’s parent class), or you may execute the following command in DDD’s console:

graph display `p thd->lex->select_lex`

Notice the backquotes in the command.

By looking at the select_lex’s fields, can you tell which of them store the table name, the

column names (in this case “*”), the where clause, and the limit clause for the query you just

executed?

1. Questions on table_list:

a. It is often helpful to know what sort of object you are dealing with by printing its

pointer type in the GDB console. Execute:

HW8: MySQL Query Processing CS245 Winter 2017

Due: Mar 7, 2017, 23:59pm

10

p &(thd->lex->select_lex.table_list)

What is table_list’s type? Note that “&” is the address-of operator, i.e., you

just printed the address of table_list.

b. The definition of table_list’s type, which is a linked list, can be found in

sql_list.h. Briefly skim the class. Notice that it has first property. And so, let us

display the first element of the linked list by executing:

graph display `p thd->lex->select_lex->table_list.first`

Navigate into the displayed variable by double clicking it. Into which fields are

the DB and table names stored?

2. Questions on item_list:

a. Similarly, what is item_list’s type? Report the command you used.

b. Just like table_list, item_list is also a linked list. Its definition can also be

found in sql_list.h. Briefly skim the class. Notice that it has head() method. And

so, let us display the head element by executing:

graph display `call thd->lex->select_lex.item_list.head()`

Into which field is the column name (in this case “*”) stored?

3. Questions on select_limit:

a. When you execute p thd->lex->select_lex.select_limit, you will

find that select_limit’s type is Item*. However, select_limit is an

instance of Item_int, which is derived from Item. Here, Item::type()

provides a hint. Execute:

call thd->lex->select_lex.select_limit->type()

What is the select_limit’s enum Type?

b. Execute:

graph display `p (Item_int*) thd->lex->select_lex.select_limit`

What is the value of value?

HW8: MySQL Query Processing CS245 Winter 2017

Due: Mar 7, 2017, 23:59pm

11

4. Questions on where:

a. Similar to 3.a, what is where’s enum Type? Report the command you used.

b. Similar to 3.b, graph display where. What is the value of arg_count? Report

the command you used.

c. Navigate into the fields of where to see how the where clause of the query you

executed is stored internally. In which fields are the arguments (in this case

“dept” and “engr”) stored?

d. Can you tell how the args that points to a pointer is related to the array

tmp_arg of two pointers? What is the role of the next pointer in the Items

pointed by args and tmp_arg?

e. You could tell where is an instance of Item_func by calling its type().

However, it is actually an instance of a subclass that is derived from Item_func,

and Item_func::functype() can provide a hint. Execute:

call ((Item_func*) thd->lex->select_lex.where)->functype()

What is where’s function type?

f. Take a look inside item_cmpfunc.h, and then based on your answer to the

previous question 4.e, of which class is where an instance? Hint: For instance,

Item_func_xor is the class for Item_Func::XOR_FUNC.

g. And, at precisely which line of the source code do you think where was

instantiated as that class? Hint: where_clause in sql_yacc.yy.

HW8: MySQL Query Processing CS245 Winter 2017

Due: Mar 7, 2017, 23:59pm

12

4. Query Rewriting

In this problem, we will look at how MySQL rewrites queries, in particular, how it removes

conditions that always evaluate to true or false.

For this problem, we will execute the following statements:

set @var := (select id from employee order by id asc limit 1);

select * from employee where @var = 1 limit 10;

We want to see how MySQL replaces the condition, @var = 1, with Item::COND_TRUE or

Item::COND_FALSE. Take a moment to skim sql_optimizer.cc:JOIN::optimize(). Pay

attention to the statements that call optimize_cond()2, particularly to the following

statement:

conds = optimize_cond(thd, conds, &cond_equal,

join_list, true, &select_lex->cond_value);

You can find the definition of JOIN::conds in sql_optimizer.h. It refers to the same Item object

referred to by thd->lex->select_lex->where3.

The value that results from evaluating where is stored in select_lex->cond_value, which

can be one of the following: Item::COND_UNDEF, Item::COND_TRUE, Item::COND_FALSE,

or Item::COND_OK. The semantics of those values can be found in the comments immediately

preceding conds’ declaration, in sql_optimizer.h.

In order to remove the redundant condition in our query, the control will move from

optimize_cond() to remove_eq_conds(), and then to internal_remove_eq_conds().

And so, briefly skim those functions as well.

Start MySQL server, MySQL client, and ddd. And then, execute the following statement in MySQL

client:

2 optimize() will also eventually call simplify_joins(). Although we will not be discussing how joins are rewritten in this assignment, it will be good to know how MySQL does it, and so make sure to explore that function and to read its documentation.

3 If you are curious, you can find the assignment of select_lex->where to conds in

sql_resolver.cc:JOIN::prepare(). It is a good exercise to trace the function calls to get there from

main(), and then to sql_optimizer.cc:JOIN::optimize(). Hint: You may start at

sql_select.cc:handle_select().

HW8: MySQL Query Processing CS245 Winter 2017

Due: Mar 7, 2017, 23:59pm

13

set @var := (select id from employee order by id asc limit 1);

Afterwards, put a breakpoint at the first statement in internal_remove_eq_conds(). Notice

that, from the function signature, conds is now referred to as cond.

Once the breakpoint is set, execute the following query:

select * from employee where @var = 1 limit 10;

1. When the program stops at the breakpoint,

a. Execute call ((Item_func*) cond)->functype() in the GDB console.

What is cond’s function type?

b. Execute call cond->const_item() in the GDB console. Is cond a const

item?

c. And then, execute call cond->is_expensive(). Does cond represent an

expression that is expensive to compute?

2. Evaluating cond involves comparing two values. Execute the following command:

graph display `p (Item_func_eq*) cond`

Then, inspect the cond’s cmp. The aforementioned two values are stored in cmp::a and

cmp::b.

a. Like cond, cmp::a is also of the type Item::FUNC_ITEM. Execute:

call ((Item_func*)(*(((Item_func_eq*)cond)->cmp.a)))-

>functype()

What is cmp::a’s function type?

b. Take a look inside item_func.h, and then based on your answer to question 2.a., of

which class is cmp::a an instance?

c. Execute the following command:

call (*((Item_func_eq*) cond)->cmp.a)->val_int()

What is cmp::a’s integer value?

HW8: MySQL Query Processing CS245 Winter 2017

Due: Mar 7, 2017, 23:59pm

14

d. Unlike cond, cmp::b is not a function. What is its type, i.e., the value returned

by type()? Also, report the command you used.

e. What is cmp::b’s integer value? Report the command you used.

3. Move the program’s execution forward until it reaches the following statement:

*cond_value = eval_const_cond(cond) ? Item::COND_TRUE : Item::COND_FALSE;

Then, execute p eval_const_cond(cond). What boolean value does cond evaluate

to?

Thus, cond_value stores either Item::COND_TRUE or Item::COND_FALSE. And,

JOIN::conds is, therefore, set to (Item*) 0, i.e., null.