anaging large amounts of data is always a challenge. Several major database vendors claim that their database engines are ready for terabytes of data, which is true to a certain extent. However, when those terabytes are not static data, but live information that people update constantly, search on, report from, and run web applications against, and when the number of concurrent users accessing the data peaks significantly as well, providing fast, unblocked access to those terabytes becomes extremely challenging.

[@more@]

It would be simple if improving the hardware solved the problem, but unfortunately increasing hardware horsepower on its own in this case doesn't buy a lot for end users. The simple truth is that the end user perception of performance often comes down to a single database record, which several processes may be trying to update and several others read at the same time. Performance and scalability often require special design tricks on the database side—and very often such tricks are database-engine specific.

From an architectural point of view, there are several ways to improve performance and scalability of database-driven applications, and you must usually implement them all to achieve the acceptable results. One technique is to change the database architecture to improve database performance in certain areas. Such changes include database splits, data normalization and de-normalization, table splits (horizontally and vertically), adding indexes and constraints, and other similar techniques. I won't go into detail on these in this article.

Sometimes however, one requirement—for example a need for fast updates—contradicts other requirements, such as a need for fast searches and real-time reporting. The latter requirements mean that you're not free to change the architecture of the database significantly; it must be preserved in a certain stable state that satisfies all the requirements to the greatest possible extent. If at the same time, you need to run a complex process on the database side, that must perform and scale well, you must usually design it as carefully crafted database stored procedure.

There are several reasons to write stored procedures. First, from the logical point of view, if the back-end logic can be divided into business logic and data logic, the latter should naturally reside on the database side, leading to cleaner design and better maintainability. Secondly, database engines compile stored procedures and save execution plans along with them, which improves performance "for free" (from the developer's point of view). Finally, and most importantly for this discussion, placing data manipulation logic into the stored procedures lets you use several approaches and tricks that can improve their performance and scalability significantly.

Some of these approaches are global, obvious, and easy to follow. Others may be specific to a particular database engine. Several approaches require non-standard, task-specific design and ideas to implement properly—and this is where the true art of database design comes to play. All the database design-specific approaches, however, usually end up with more complex script than improper design, and that's the major reason why they are often not followed.

In this article, I'll share a few tips that I have personally found to be the most important when developing SQL Server stored procedues. These tricks reflect both the outcome of my mano-a-mano fights with the database to achieve better performance, and a summary of the experience of the extremely talented software developers I've had the privilege to work with. I'll present the tips, and then illustrate them by the stored procedure-based SUDOKU solution I have designed for this purpose.

Tips for Writing High-performance SQL
These tips apply broadly when writing high-performance stored procedures. Unfortunately, unlike some tips, you can't simply apply most of them without first considering the nature and schema of the data you're querying.

Avoid using cursors (as well as other looping structures) as much as possible. Cursors are inefficient, and database engines usually don't have the best loop implementations in terms of performance.

On the database side, you can usually replace code involving cursors with aggregate SQL statements (SELECT, INSERT, and UPDATE) that use vector tables. All database engines are heavily optimized for aggregate statements, so even if a loop is unavoidable, it is always better to execute a few aggregate statements in a loop with a small number of iterations, than to create a cursor and execute simple statements over a large number of iterations.

Even if initial performance tests, especially with a small amount of data, show cursors to be more efficient than a complex aggregate statement, it is worthwhile to try to optimize the operation by breaking it into smaller portions or using other approaches—unless you can guarantee that the data value will stay small. Cursor approaches will not scale.

Filter data wisely. One alternative to using cursors uses a fall-through approach, filtering and aggregating data in multiple steps via a set of data storages, which could be physical tables, temporary tables, or table variables. It is usually best to include some aggregate filters into aggregate statements to filter out the majority of data in one simple shot whenever necessary, working on smaller amounts of data. Then you can proceed with joining and filtering, making sure to keep the number of join permutations under control at all times.

It is usually more efficient to execute multiple statements with one condition than a single statement with multiple OR conditions when executing UPDATE and DELETE statements against permanent database tables that can be accessed by multiple users simultaneously. This tip is especially important from the scalability point of view; from the performance point of view the difference is usually marginal. The major reason for the tip is the locking of the database records and the lock escalations that occur behind the scenes.

Make wise distinctions between temp tables and table variables. Table variables are in-memory structures that may work from 2-100 times faster than temp tables. But keep in mind that access to table variables gets slower as the volume of data they contain grows. At some point, table variables will overflow the available memory and that kills the performance. Therefore, use table variables only when their data content is guaranteed not to grow unpredictably; the breaking size is around several thousand records. For larger data volumes, I recommend temp tables with clustered indexes. Interestingly, I've found that a temp table with one clustered index is often faster than having multiple simple indexes. In contrast, multiple simple indexes with physical tables are often faster than one clustered index.

Make careful distinctions between hard rules and assumptions. This is more of a business design tip, which applies more to code design than to performance and scalability design in general. In real life however, performance and scalability are generally the first things to suffer from improper design. When rules are implemented as assumptions, they usually cause unnecessary calculations to be performed, affecting performance. However, when assumptions are implemented as rules they tend to cause errors and algorithm failures, which usually requires an urgent redesign. That, in turn, is usually performed with business constraints and results in inefficient final algorithms. That's because bad design decisions are often corrected in a rush and without sufficient resources—sometimes under pressure from customers whose businesses are usually in a critical stage when problems are uncovered, but must continue operating during the process.

Pay attention to join order. Using proper join order sometimes lets the database engine generate hints that execute joins with an optimal amount of records. Most database engines also support hard hints, but in most cases you should avoid using hard hints and let the database engine figure out the best way to do its job on its own.

Be careful when joining complex views to other views and database tables in complex SELECT statements. When the database contains a significant amount of data, SQL Server engine tends to recalculate the execution plan of the resulting statement, which often results in an inefficient execution plan and may kill the performance. The most difficult part is that the behavior of SQL Server engine is inconsistent in that respect, and heavily depends on the database size, indexes, foreign keys, and other database structures and constraints. The consistent work-around is to pre-select data from the view into a temp table with the reasonable pre-filters, and then use that temp table in place of the underlying view.

Create indexes on temp tables wisely. As mentioned in Tip 4, clustered indexes are usually the best in terms of performance for temp tables; however, there is a difference between creating the index before or after inserting data into the temp table. Creating the index before the insert complicates the insert, because the database engine must order the selection. For complex selections such as those mentioned in Tip 7, the extra ordering may overcomplicate the overall statement and drastically degrade the performance. On the other hand, creating the index after the insert forces the database engine to recalculate the execution plan of the stored procedure every time it is called. Therefore, the decision is always a trade-off and you should make it based on the relative costs of the two possibilities.

In general, try to avoid execution plan recalculation. One common cause of recalculation occurs when the stored procedure contains several paths that depend on values passed in parameters. However, whether avoiding recalculation is possible depends on the complexity of the stored procedure and on other circumstances, such as those described in tip 8. When the engine does recalculate execution, performance always suffers; however, recalculating the execution plan of the caller does not force the execution plan recalculation of the called procedure (or view or function). Therefore, the workaround is to divide one stored procedure into multiple procedures (depending on the passed-in parameters), and then call the children from the parent conditionally. You should perform this subdivision very carefully though, because it can be a maintenance nightmare—but sometimes it seems to be the only way to achieve acceptable database performance and scalability.

Finally, although this isn't either a performance or a scalability tip, I urge you to format your stored procedure scripts legibly. It's best to agree on common practices such as clause order and formatting rules with your coworkers in advance. Not only does that help avoid errors, it also clearly shows the logical structure of the statements and often aids in figuring out faulty filters and joins.

This list of tips is certainly not exhaustive, but they probably cover the most important performance and scalability factors. Still, there's nothing like an example to drive home the point. The Sudoku solution described in the rest of this article illustrates the techniques in the first six tips.

The Sudoku Solution
I decided to craft this stored-procedure based Sudoku solution example for multiple reasons. First, Sudoku uses a 9x9 table, so thinking of this table as a database table is pretty natural. Second, the database solution has a relatively simple structure, while at the same time—if used properly—nicely represents a high-performance implementation of the solution, using the first several tips discussed in the preceding section. Finally, because it's flexible, a database solution lets you look at the puzzle itself from a generic perspective, showing ways you can extend the puzzle solution to more generic tasks.

The solution consists of two database tables: one that stores the initial puzzle data, and a second that stores calculation results, the script(s) to fill the tables, and the stored procedure that calculates the results. You can download the sample code to experiment with the solution yourself.

The TEMP_GS_V1.sql script contains the table definitions. In the first table, each record represents a Sudoku cell. In the second table, each record represents the entire solution. The reason for structuring the tables in this manner will become clearer as you walk through the solution; but keep in mind that it's important that the structure be both flexible and avoid intermixing rules with assumptions (see Tip 5).


Figure 1. Unsolved A1Escargot Puzzle: The figure shows A1Escargot puzzle in its initial unsolved state.

Sudoku's solution rules state that each row, column, and 3x3 square must contain a unique set of digits from 1 to 9. The assumption is that given the initial set of numbers, there is only one resulting solution; otherwise the puzzle is considered invalid. This assumption, if transferred to real-life applications, sounds unreasonable. There is no way that developers can ensure a valid data state; therefore, I wanted my Sudoku solution to reflect that—to work properly whether the puzzle is valid or not—and for invalid puzzles, simply figure out all the possible solutions.

I have not created an interface for data entry (to create puzzles), because that's outside of the solution scope, so you have to enter data into the TEMP_GS table using a script. The Fill_GS.sql script contains the data entry example, which corresponds to a very complex Sudoku puzzle called "AlEscargot," which was once claimed to be the most complex SUDOKU puzzle ever. Figure 1 shows the puzzle in its initial unsolved state .

The CALCULATE_GS stored procedure (see the CALCULATE_GS_V3.sql script) contains the solution algorithm, which contains a set of steps that illustrate the aforementioned SQL Server tips. Note that the CALCULATE_GS procedure accepts a Puzzle Number parameter. This is just a habit to get into when you're creating solutions suitable for multi-user environments: if puzzles have distinct IDs, then multiple users can run the stored procedure at the same time, using the same physical tables.

The CALCULATE_GS stored procedure contains the following steps:

Create a table variable, filling it with numbers from 1 to 9. This table variable illustrates Tip 4, because there are not that many records in the table, and it is heavily used within the stored procedure, sometimes joining to itself three times. In such cases, using a table variable is far more efficient than using a temp table. The procedure fills in the TEMP_GS table with all possible rows that correspond to possible cells using two operations:
First, it calculates singletons—the cells which can contain only one possible value to meet the Sudoku rules. This evaluation gets performed multiple times, because new cells may become singletons as a result of previous calculations. This operation does not use a cursor; instead, it executes aggregate statements until singletons exist, which minimizes the number of possible statements. This approach illustrates Tip 1.
Second, when no more singletons exist, the rest of the cells are filled with multiple possible digits, again in one operation (see tip 1).
Next, the procedure creates and fills in a temporary table that corresponds to possible Sudoku rows. In this case, because the number of records is unpredictable in complex scenarios, it's best to use a temp table. It's worth noting that solving the solution using table variables may be faster in many instances. In real life, such situations may necessitate two separate stored procedures—one optimized for low volumes, and another for high volumes. The creation of the intermediate table over the aggregation steps illustrates Tip 2. This table eliminates a lot of invalid combinations based on a very simple aggregation condition: the sum of cells within each row must be equal to 45. This condition is simple enough (compared to the requirement to have unique digits within each row) to not slow down the aggregate statement, yet powerful enough to eliminate the majority of invalid combinations in a single step.
Create and fill in a temporary table that corresponds to possible Sudoku columns, following the same paradigm as in step 2. This approach illustrates Tip 2: even though logically you need only the table that corresponds to valid rows, joining it to itself at a later point may result internally in billions of combinations after the joins and before filtering, which kills performance. In this and the next step, creating the extra tables to join lets you eliminate the possibility of too many intermediate combinations during joins.
Create and fill in a temporary table that corresponds to possible Sudoku squares, following the same paradigm as in steps 2 and 3.
Delete the records that do not adhere to Sudoku's rules from the three temp tables created in steps 2- 4. The tables were pre-filtered using aggregate conditions when the records were inserted into them, so now deleting the extra rows using precise conditions is fast because it operates on a small subset of data. In addition, deleting records separately based on each violation is faster than using an OR clause (the violations are represented by the rows where any value is repeated in any pair of columns). I combined all the delete statements into one loop using dynamic SQL to save some space. Because the statements are simple and don't require complex execution plans, using dynamic SQL does not affect performance much. However, if the statements were complex, it might be wiser to avoid dynamic SQL. This step illustrates Tip 3, though as explained earlier, it is not terribly important for temp tables—it's far more important for permanent database tables.
After completing step 5, all three temp tables contain only valid Sudoku structures (rows, columns, and squares); therefore, simply joining them must result in valid solutions (note that for some puzzles, multiple solutions are possible). Joining operations are always fast, as long as the joined tables are small or indexed, and as long as the intermediate steps do not result in too many rows. The database engine is usually smart in optimizing the joins; however, specifying the proper join order helps the engine sometimes. Therefore, this step joins the three tables in palette order, starting with a row, then a column, then another row, another column, etc., involving squares at some point, so that each extra join will potentially result in the minimal number of virtual rows. The result of the join is inserted into the TEMP_GS_RESULT table, which contains the solutions, as mentioned earlier.

For puzzles that have only one solution, the FinalSelect.sql script selects from the TEMP_GS_RESULT table and outputs the solution in tabular form. The represented solution is reasonably fast. It solves the "AlEscargot" puzzle from Figure 1 in five minutes and twenty seconds on a 2.4 GHz machine with 1 GB of RAM. The solution (output in text form) is shown below:

   1    6    2    8    5    7    4    9    3

   

   5    3    4    1    2    9    6    7    8

   

   7    8    9    6    4    3    5    2    1

   

   4    7    5    3    1    2    9    8    6

   

   9    1    3    5    8    6    7    4    2

   

   6    2    8    7    9    4    1    3    5

   

   3    5    6    4    7    8    2    1    9

   

   2    4    1    9    3    5    8    6    7

   

   8    9    7    2    6    1    3    5    4

Apart from the efficiency however, the solution, being both database-based and properly designed, represents the following additional advantages:

It does not assume that puzzles have only a single solution; it finds all the possible solutions if more than one is available.
It has a very clean structure that prevents mistakes.

The clean structure provides clear paths to extensibility. For example, because all the steps and joins are pretty obvious, it's very easy to rewrite the solution using dynamic SQL, extending it to higher dimensions. The standard Sudoku puzzle is based on a 32 x 32 square. Using a similar clean stored procedure structure and dynamic SQL, it is very easy to extend the solution to solve N2 x N2 square puzzles—and the resulting solution is guaranteed to perform well. However, for big Ns you should redesign the final result table to be cell-based rather than solution based, otherwise it may easily exceed SQL Server's column number and row length limitations. That's a rather simple enhancement though, and doesn't affect the efficiency of the major algorithm. In other words, the solution is scalable, and extended versions of the solution will be scalable as well, and that clearly differentiates it from the other possible solutions.

Designing High Performance Stored Procedures