I have decided to move my blog over to http://agilesql.club/blogs/Ed-Elliott – this will be the last post here so come over and visit over there, it is much much more exciting (I promise)!
The Sql Server Batch Processor
Little is publicly known about the batch parser in Sql Server, in contrast, cool things like page structures and GAM’s have been pretty widely documented from the inside sql series of books by Kalen Delaney to the blogs of the Sql Server insider Paul Randal. The most that the batch parser normally gets is a line to say that it parses the query to confirm it is valid, this article from 1999 (http://sqlmag.com/t-sql/inside-sql-server-parse-compile-and-optimize) is typical in its thriftiness for the batch processor:“I won’t go into detail about the query-parsing process. In short, the parser checks for correct syntax, including the correct spelling and use of keywords”
In this post I am going to explore how the parser actually works and what components are involved in converting the query that is sent to Sql Server into something that can be acted upon.
Overview of the process
Essentially a string is passed to sql server, the string could say anything such as “exec store_procedure” or “create stored procedure hello_there” the command could be valid or invalid “select top banana from shop”, it might be very short or very long, it might contain one statement or lots. Sometimes everything seems good but then a table doesn’t actually exits so it isn’t very good. The command could be a DDL command, it could be a DML command or it could all be mixed together. There are some commands which are valid at one point but invalid at another, Hekaton has some additional values such as “NATIVE_COMPILATION” which is valid in the case of a Hekaton stored procedure but not valid normally.
The parsing process is about taking these input strings, whatever they might contain and converting them into something that Sql can actually use if it is valid or rejecting the statement if it is invalid. The general idea for things like this is to parse as quick as possible, get everything you need and if it is invalid throw it away as quick as possible – Sql has a hard enough time to do everything it is meant to do without wasting cycles on queries it cannot service.
The whole point of of the parsing process is to take a string and turn it into a normalized tree. A normalized tree is a set of operations that tell Sql Server exactly what to do, rather than:
“SELECT blah from TableBlah Where Column = 123”
Sql needs exact instructions such as:
constant @value of type int is 123
Read From Table “TableBlah”
Filter to include rows where Column = @value
Return column Blah
If we look at the specific instructions, there are two things that have happened here, firstly the text has been converted into steps and secondly, the parts that can change such as the value have been removed, i.e. the query has been normalized.
What are the inputs to the parsing?
- String to be parsed
- This is passed in from the client as should be fairly obvious
- Parsing options
- Are quoted identifiers enabled or not, whether these are on or not really affect how the parser works
- Compatibility Mode
- Different keywords and commands are allowed depening on the compatibility mode
- Server properties
- Some commands, such as Hekaton, are only valid on certain instance types
The catalogue of object names and security permissions, although used in the process are not part of the query parsing and normalization, they are used when the actual actual work is done.
What is the output from the parsing?
The output is a tree of operations such as a create table statement or a select statement, each of these statements are valid, however in some cases, such as a create database statement without a database name, are invalid and the parser wouldn’t be able to add it to the tree.
The size and complexity of the tree is entirely down to the size and complexity of the query that is parsed. Benjamin Nevarez has some interesting trace flags in his post
(http://www.benjaminnevarez.com/2012/04/more-undocumented-query-optimizer-trace-flags/) which show the logical and physical trees used as part of the compilation process.
A look in detail at the parsing
The parser uses CParser from sqllang.dll which seems to have two roles, firstly it retrieves the actual characters from the string and then stores the tree, it adds operations to the tree by adding new step into the tree.
The batch parser uses a model based on yacc and lex, the internal functions are yyparse and yylex, yyparse is particularly long and probably generated using yacc or what was once yacc but is now, perhaps, heavily customized.
What are yacc and lex?
Aside from sounding like a weird cow like animal that is particularly hard to shear and a character from superman, yacc and lex are basically the most widely used tools by compilers to understand the text that they are given. It is often easy to forget that Sql Server is, amongst other things, a compiler, well it is! Sql has to do something with the text that it receives and yacc and lex are the answer.
Lex breaks the text into tokens, it has a set of tokens it is trained to find, so for example it will know that “select”, “insert”, “with”, “for” and “1980” are all tokens and they have a specific type.
Yacc understands when and where certain tokens can be used, for example “NATIVE_COMPILATION” is a token that lex can parse and returns to yacc that it it a valid token but it is up to yacc to decide whether it is valid after a “SELECT” or only valid after a stored procedure definition on an instance of Sql Server where Hekaton is enabled (i.e. 2014 enterprise edition).
Typically with yacc, what happens is you have a grammar definition that specifies the tokens and where they are valid and then you use yacc to compile that garmmer into actual source code which is essentially a giant state machine which parses queries into a tree. To give some idea about size for this, I believe that the main switch statement in Sql 2014 has 3534 entries in it, this is far to large to be managed sensibly by developers so I really hope that the query parsing team don’t have to manage it manually!
The process for decoding a string is that yyparse which is the state machine built by taking the T-Sql grammar file and compiling it by yacc keeps calling yylex which returns the type of token that it has found until it reaches the end of the string at which point it returns -1.
While yyparse retrieves valid tokens it takes action according to the type of token so for example, the token returned by yylex is a SELECT token then yyparse adds a step into the tree held by the CParser to do a select and adds in all the values needed such as whether there is a table and what the table name is.
What makes the whole process so quick is that firstly that it is basically one big while loop wrapping a case statement which uses only a minimum amount of instructions to parse the text and it makes only a single parse through the query, it isn’t pretty and probably isn’t very easy to write unit tests for but it is fast and does the job. It certainly has to as it is used every time any text is sent to Sql Server to execute, whether valid or invalid.
How was this info gathered
Although I would love to have access to the Sql Server source code, I don’t, this was gathered using a debugger (windbg / cdb), the public symbols and a real thirst for understanding how Sql Server parsed the strange queries that is sent so accurately, quickly and consistently. The books on Sql Server internals seem to leave off the parser which is a shame as it is so core to everything that Sql does, I guess because it is so reliable there is no need to bother documenting what it does or how it works.
In this post I want to talk about the headers that are included in most stored procedures and because I often see requirements for writing T-Sql code that include something along the lines of “all procedures MUST be well commented and adorned with a template that lists each change”, sample procedures include templates such as:
These are all taken from stack overflow: http://stackoverflow.com/questions/1085672/sql-comment-header-examples
These templates all impart a sense of professionalism and care has been taken to write correct comments and we all know that dutifully ensuring comments are correct help show how well the database code has been written, except that is rarely the case. The quality of comments bears no relationship to the quality of code, in fact code is compiled by a compiler so incorrect code can be detected, incorrect comments are ignored by the compiler just as much as correct comments.
There is one section in particular that is a problem, the change history – if you use source control you have a history of changes which can be validated and checked, however if instead you rely on the comments then you are relying completely on developers updating the comments correctly which is unlikely, either due to time constraints (i’ll fix this now and update comments later!) or potentially maliciously. If you then consider that the version history could possibly be incorrect then it can’t be trusted and if it can’t be trusted then it is irrelevant and is a waste of time.
Most templates include the name of the stored procedure, in case you don’t know how to find the name of each stored procedure is at the top of script, either after the word proc or procedure, it is sometimes surrounded by [ and ] or “‘s, if the name is in the comments then it might be after something like “Name:”, “Nombre: “, “Monkier:”, “Namesake: “, “Stage Name:” or something more exotic:
So if we use source control (which we should) then if we take the above we have a verifiable history of who the author is (they checked in the code), when the create date is (the date it was checked in), the history which includes both comments and actual diff’s can also be taken from source control.
The above code could have been written as:
Not everything is useless, there is a real case for showing example usages, especially when there are multiple parameters passed in and it is good to see what possible parameters can be passed in but I think that is about it. You may also wish to include a copyright banner in amongst the header comments which can be included to keep lawyers happy and can be happily ignored by everyone else.
I have come up with a list of common things I see in headers and how they should be handled:
|Name||Script Header||It is already in an obvious place, if the editor has colouring (everything other than notepad) then it is nicely highlighted|
|Author||Source control history||It is verified and easily retrievable|
|Create Date||Source control history||It is verified and easily retrievable|
|Change History||Source control history||It is verified and easily retrievable|
|Description||The stored procedure name||If the name of the stored procedure does not accurately describe the action that it does then rename it!|
In another post I would like to talk about comments within stored procedures but I think that is enough for now!
I generally break my unit tests for T-Sql into two broad categories, the first is the schema, things like foreign keys and defaults and the second is the actual code, stored procedures and functions (triggers as well but I do my hardest not to write triggers on account of them being so awful.)
When I look at what tests to write for constraints, I ask myself the question, as I do any time I write anything “what is the point of this, do I need to write it?” and I normally come back with the answer (it is a lot of fun in my head, it is like a regular party with everyone super excited about sql server) “if there is no point in writing the tests, is there any point in the original code”, I then think whether the constraint is necessary and adds anything, if it does then it stays and I write tests for it.
I write tests for the schema as a schema is code, it might be dml instead of ddl but you better believe it is possible to get something wrong and for someone to delete something by mistake and most importantly that I or someone else will want to refactor it later and knowing what will break when a change is made is extremely important.
To write the actual tests I use tSQLt as my preferred framework and I fake the child table and the reference table and use tSQLt.ApplyConstraint to test the actual constraint. If we take this simple schema, two tables and a foreign key, if we then look at how we test the constraint, what do we care about:
– I care whether the key is defined correctly so will accept values in the child table that are in the parent table
– I care whether the key stops incorrect values going into the child table
I don’t really care if the foreign key allows all the possible values, that is Sql Servers job, I just want to ensure it does what it is I have told it to do.
To write tests, I create a new test class:
exec tSQLt.NewTestClass ‘constraint on users.users gender_id’
I could have one class per table or one per database but there isn’t much of an overhead having lots of classes and it helps to keep them separate and organised so I create one class per constraint.
Once I have the test class, I create the setup routine that will be used by my test “methods” (procedures). The Setup procedure is run before each test and it means you don’t have to keep doing the same thing so it is less rrepetitive and less error prone, remember kids no one is testing the tests!
create procedure [constraint on users.users].[SetUp]
exec tSQLt.FakeTable ‘gender’, ‘users’
exec tSQLt.FakeTable ‘users’, ‘users’
insert into users.gender(gender_id, gender_description)
select 1, ‘Male’
exec tSQLt.ApplyConstraint ‘users’, ‘FK_users_gender_id_genders’, ‘users’
In this case, I fake out the two tables involved in the relationship and in the parent table I add a single row. I could add every possible row and then check they work but I believe that is overkill and while it might be straightforward for gender, it is a little more involved for products that have ever been sold at a major supermarket or retailer. With the data added I then ensure that the constraint is applied and ready to allow good data to go in and block bad data and keep it out.
I then have my “happy path” test which tests that I can add data to the child table:
create procedure [constraint on users.users].[test allows male]
insert into users.users(user_id, gender_id)
select 100, 1
I typically always expect this to work, but it I have setup the foreign key to point to another column by accident then the insert will fail as in my test setup I only inserted a valid value into the gender_id column in the genders table.
Once I have the good path that should work if I have set everything up correctly, I then look to the unhappy path, that is what happens when something goes wrong – if I try to insert data that is incorrect, does it fail? When it fails, does it fail in the way I expect?
create procedure [constraint on users.users].[test disallows unknown values]
exec tSQLt.ExpectException @ExpectedMessagePattern = ‘%INSERT%conflicted%FOREIGN KEY%’
insert into users.users(user_id, gender_id)
select 100, 94898
So here all I am doing is saying that I expect to get an exception somewhere in this procedure and when I do the error is going to match the pattern “%INSERT%conflicted%FOREIGN KEY%” if it doesn’t then the test will fail. I then do the insert with a value I know not to exist (remember I am faking both tables in the Setup so know exactly what values are in each table.
Finally I need to run the tests using either “exec tSQLt.RunAll” or “exec tSQLt.RunTest ‘constraint on users.users’
The tests will run and either pass or fail, to do TDD properly you should create and run the tests, the first one should always complete so it is hard to get the red before the green but the second test should fail before the foreign key is put in, when it is added then the test will succeed as there will be an error message.
I personally think that this finds the right balance between writing tests for the sake of it (1 test for every possible value in the reference table!) and writing enough to know that the constraint is in place and it is doing what it should do – irrespective of whether the values in the table are actually correct.
Like most things in life and in IT the amount and type of tests depends on the type of application and how the application is being used, a test database for me to play around with sql doesn’t really need unit testing but a database that is critical in a medical EMR application should be well tested.
The code that goes along with this is:
I was watching this TechEd (I have always loved that name!) video from 2011 about the new way of doing database development, in a modern way – Sql Server Data Tools or SSDT:
It got me thinking how far we have come in the last 3 years and I really feel strongly that as T-Sql developers we have tools and practices we can follow to make our code maintainable, reliable and enjoyable!
We can build systems which are well tested which brings with it a well defined host of benefits such as:
- Ability to refactor T-Sql code & data schema
- Confidence that T-Sql code is valid and works as expected
We can concentrate on designing schemas and writing code rather than writing scripts to upgrade schema’s when we deploy using:
- DacPacs + SqlPackage.exe
- Redgate Sql compare
- ApexSql Diff
We can ensure that our code is safe and we can delete unused code because we have source control, my preferred is Git but anything will do, heck even the much derided visual source safe would be an improvement over most T-Sql code projects!
We can refactor code and schemas using SSDT to rename and move objects, we can make changes and build a project and get instant feedback about whether we have broken anything.
We can improve the speed of our development using intellisense and if we are lucky enough to have resharper we can just straight to objects we know the name of (ctrl+t with resharper shortcuts).
We can find out what references objects effectively and find unused objects etc.
We can build a continuous integration pipeline, we can check code in and have a build system like TeamCity checkout the code, build the project, deploy the project and tests and then run the tests – there are a few separate parts to set up here but it is easily doable.
To some extent we can debug T-Sql, this is an area that definitely needs improving but at least we can step through and see the values of variables – until we can visualise what is in tables I think T-Sql debugging will always be lacking.
I am really passionate about modern T-Sql development, please take a look at how you develop T-Sql code and think about whether you are using modern techniques to develop code and whether you life would be any better if you started employing some modern T-Sql development techniques.
In looking around for other resources for modern T-Sql development I was unable to find that much so thought it would be great to start a linked-in group to start collating and sharing best practices, if you are interested please head over to the new “Modern T-Sql Development group on linkedin: