The Sql Server Batch Processor
Little is publicly known about the batch parser in Sql Server, in contrast, cool things like page structures and GAM’s have been pretty widely documented from the inside sql series of books by Kalen Delaney to the blogs of the Sql Server insider Paul Randal. The most that the batch parser normally gets is a line to say that it parses the query to confirm it is valid, this article from 1999 (http://sqlmag.com/t-sql/inside-sql-server-parse-compile-and-optimize) is typical in its thriftiness for the batch processor:“I won’t go into detail about the query-parsing process. In short, the parser checks for correct syntax, including the correct spelling and use of keywords”
In this post I am going to explore how the parser actually works and what components are involved in converting the query that is sent to Sql Server into something that can be acted upon.
Overview of the process
Essentially a string is passed to sql server, the string could say anything such as “exec store_procedure” or “create stored procedure hello_there” the command could be valid or invalid “select top banana from shop”, it might be very short or very long, it might contain one statement or lots. Sometimes everything seems good but then a table doesn’t actually exits so it isn’t very good. The command could be a DDL command, it could be a DML command or it could all be mixed together. There are some commands which are valid at one point but invalid at another, Hekaton has some additional values such as “NATIVE_COMPILATION” which is valid in the case of a Hekaton stored procedure but not valid normally.
The parsing process is about taking these input strings, whatever they might contain and converting them into something that Sql can actually use if it is valid or rejecting the statement if it is invalid. The general idea for things like this is to parse as quick as possible, get everything you need and if it is invalid throw it away as quick as possible – Sql has a hard enough time to do everything it is meant to do without wasting cycles on queries it cannot service.
The whole point of of the parsing process is to take a string and turn it into a normalized tree. A normalized tree is a set of operations that tell Sql Server exactly what to do, rather than:
“SELECT blah from TableBlah Where Column = 123”
Sql needs exact instructions such as:
constant @value of type int is 123
Read From Table “TableBlah”
Filter to include rows where Column = @value
Return column Blah
If we look at the specific instructions, there are two things that have happened here, firstly the text has been converted into steps and secondly, the parts that can change such as the value have been removed, i.e. the query has been normalized.
What are the inputs to the parsing?
- String to be parsed
- This is passed in from the client as should be fairly obvious
- Parsing options
- Are quoted identifiers enabled or not, whether these are on or not really affect how the parser works
- Compatibility Mode
- Different keywords and commands are allowed depening on the compatibility mode
- Server properties
- Some commands, such as Hekaton, are only valid on certain instance types
The catalogue of object names and security permissions, although used in the process are not part of the query parsing and normalization, they are used when the actual actual work is done.
What is the output from the parsing?
The output is a tree of operations such as a create table statement or a select statement, each of these statements are valid, however in some cases, such as a create database statement without a database name, are invalid and the parser wouldn’t be able to add it to the tree.
The size and complexity of the tree is entirely down to the size and complexity of the query that is parsed. Benjamin Nevarez has some interesting trace flags in his post
(http://www.benjaminnevarez.com/2012/04/more-undocumented-query-optimizer-trace-flags/) which show the logical and physical trees used as part of the compilation process.
A look in detail at the parsing
The parser uses CParser from sqllang.dll which seems to have two roles, firstly it retrieves the actual characters from the string and then stores the tree, it adds operations to the tree by adding new step into the tree.
The batch parser uses a model based on yacc and lex, the internal functions are yyparse and yylex, yyparse is particularly long and probably generated using yacc or what was once yacc but is now, perhaps, heavily customized.
What are yacc and lex?
Aside from sounding like a weird cow like animal that is particularly hard to shear and a character from superman, yacc and lex are basically the most widely used tools by compilers to understand the text that they are given. It is often easy to forget that Sql Server is, amongst other things, a compiler, well it is! Sql has to do something with the text that it receives and yacc and lex are the answer.
Lex breaks the text into tokens, it has a set of tokens it is trained to find, so for example it will know that “select”, “insert”, “with”, “for” and “1980” are all tokens and they have a specific type.
Yacc understands when and where certain tokens can be used, for example “NATIVE_COMPILATION” is a token that lex can parse and returns to yacc that it it a valid token but it is up to yacc to decide whether it is valid after a “SELECT” or only valid after a stored procedure definition on an instance of Sql Server where Hekaton is enabled (i.e. 2014 enterprise edition).
Typically with yacc, what happens is you have a grammar definition that specifies the tokens and where they are valid and then you use yacc to compile that garmmer into actual source code which is essentially a giant state machine which parses queries into a tree. To give some idea about size for this, I believe that the main switch statement in Sql 2014 has 3534 entries in it, this is far to large to be managed sensibly by developers so I really hope that the query parsing team don’t have to manage it manually!
The process for decoding a string is that yyparse which is the state machine built by taking the T-Sql grammar file and compiling it by yacc keeps calling yylex which returns the type of token that it has found until it reaches the end of the string at which point it returns -1.
While yyparse retrieves valid tokens it takes action according to the type of token so for example, the token returned by yylex is a SELECT token then yyparse adds a step into the tree held by the CParser to do a select and adds in all the values needed such as whether there is a table and what the table name is.
What makes the whole process so quick is that firstly that it is basically one big while loop wrapping a case statement which uses only a minimum amount of instructions to parse the text and it makes only a single parse through the query, it isn’t pretty and probably isn’t very easy to write unit tests for but it is fast and does the job. It certainly has to as it is used every time any text is sent to Sql Server to execute, whether valid or invalid.
How was this info gathered
Although I would love to have access to the Sql Server source code, I don’t, this was gathered using a debugger (windbg / cdb), the public symbols and a real thirst for understanding how Sql Server parsed the strange queries that is sent so accurately, quickly and consistently. The books on Sql Server internals seem to leave off the parser which is a shame as it is so core to everything that Sql does, I guess because it is so reliable there is no need to bother documenting what it does or how it works.