quinta-feira, 24 de março de 2016

It works better when you do the basics right...

So, I was doing some tuning as usual and I always report things in writing as I find them so I can justify the hours I put in. On one of these reports, I ended up identifying on one of the companys application, tables using VARCHAR(200) fields as primary keys.

Ok, I have been a developer once, and I know that sometimes development can be a messy business. You have little time (sometimes no time), few resources, almost no documentation, etc. 

I was asked if there was actually any proof that integer, self generating columns were a better deal for PKs. 

So here is what I came up with: I created a two table sample database, using at first a varchar(200) column as a PK and I created a FOREIGN KEY relationship for them. I measured the time it took to insert 100.000 row in them, the time and statitstics to perform a simple select on them and index sizing. The sample was done in SQL SERVER 2014.

Scripts do create the tables using VARCHAR as PK:

DROP TABLE customerdetails; 

DROP TABLE customer; 

CREATE TABLE customer 
  ( 
     id        VARCHAR(250) PRIMARY KEY, 
     NAME      VARCHAR(100), 
     birthdate DATETIME, 
     mmo       VARCHAR(max) 
  ) 

go 

CREATE TABLE customerdetails 
  ( 
     customerid VARCHAR(250) REFERENCES customer(id), 
     detaildate DATETIME, 
     mmo        VARCHAR(max), 
     auditvalue BIT 
  ) 

go 

DECLARE @dtinicio DATETIME = Getdate(); 
DECLARE @cont INT = 0; 
DECLARE @str VARCHAR(250) 

WHILE ( @cont < 100000 ) 
  BEGIN 
      SET @str=Cast(@cont AS VARCHAR); 

      INSERT INTO customer 
                  (id, 
                   NAME, 
                   birthdate, 
                   mmo) 
      SELECT CASE 
               WHEN Len(@str) < 250 THEN Replicate('0', 250-Len(@str)) + @str 
               ELSE @str 
             END, 
             'Customer ' + @str, 
             Getdate() + @cont, 
             CASE WHEN Len(@str) < 250 THEN Replicate('0', 250-Len(@str))+@str 
             ELSE 
             @str END 
             + 'Customer ' + @str; 

      INSERT INTO customerdetails 
                  (customerid, 
                   detaildate, 
                   mmo, 
                   auditvalue) 
      SELECT CASE 
               WHEN Len(@str) < 250 THEN Replicate('0', 250-Len(@str)) + @str 
               ELSE @str 
             END, 
             Getdate(), 
             CASE 
               WHEN Len(@str) < 250 THEN Replicate('0', 250-Len(@str)) + @str 
               ELSE @str 
             END, 
             1; 

      INSERT INTO customerdetails 
                  (customerid, 
                   detaildate, 
                   mmo, 
                   auditvalue) 
      SELECT CASE 
               WHEN Len(@str) < 250 THEN Replicate('0', 250-Len(@str)) + @str 
               ELSE @str 
             END, 
             Getdate(), 
             CASE 
               WHEN Len(@str) < 250 THEN Replicate('0', 250-Len(@str)) + @str 
               ELSE @str 
             END, 
             1; 

      INSERT INTO customerdetails 
                  (customerid, 
                   detaildate, 
                   mmo, 
                   auditvalue) 
      SELECT CASE 
               WHEN Len(@str) < 250 THEN Replicate('0', 250-Len(@str)) + @str 
               ELSE @str 
             END, 
             Getdate(), 
             CASE 
               WHEN Len(@str) < 250 THEN Replicate('0', 250-Len(@str)) + @str 
               ELSE @str 
             END, 
             1; 

      SET @cont=@cont + 1; 
  END; 

SELECT Datediff(millisecond, @dtinicio, Getdate()) AS 
       [Tempo para inserir 100.000]; 



Scripts to create the same tables as integer:

DROP TABLE customerdetails; 

DROP TABLE customer; 

CREATE TABLE customer 
  ( 
     id        INT IDENTITY(1, 1) PRIMARY KEY, 
     NAME      VARCHAR(100), 
     birthdate DATETIME, 
     mmo       VARCHAR(max) 
  ) 

go 

CREATE TABLE customerdetails 
  ( 
     customerid INT REFERENCES customer(id), 
     detaildate DATETIME, 
     mmo        VARCHAR(max), 
     auditvalue BIT 
  ) 

go 

DECLARE @dtinicio DATETIME = Getdate(); 
DECLARE @cont INT = 0; 
DECLARE @id INT = 0; 
DECLARE @str VARCHAR(250) 

WHILE ( @cont < 100000 ) 
  BEGIN 
      SET @str=Cast(@cont AS VARCHAR); 

      INSERT INTO customer 
                  (NAME, 
                   birthdate, 
                   mmo) 
      SELECT 'Customer ' + @str, 
             Getdate() + @cont, 
             CASE WHEN Len(@str) < 250 THEN Replicate('0', 250-Len(@str))+@str 
             ELSE 
             @str END 
             + 'Customer ' + @str; 

      SELECT @id = @@IDENTITY; 

      INSERT INTO customerdetails 
                  (customerid, 
                   detaildate, 
                   mmo, 
                   auditvalue) 
      SELECT @id, 
             Getdate(), 
             CASE 
               WHEN Len(@str) < 250 THEN Replicate('0', 250-Len(@str)) + @str 
               ELSE @str 
             END, 
             1; 

      INSERT INTO customerdetails 
                  (customerid, 
                   detaildate, 
                   mmo, 
                   auditvalue) 
      SELECT @id, 
             Getdate(), 
             CASE 
               WHEN Len(@str) < 250 THEN Replicate('0', 250-Len(@str)) + @str 
               ELSE @str 
             END, 
             1; 

      INSERT INTO customerdetails 
                  (customerid, 
                   detaildate, 
                   mmo, 
                   auditvalue) 
      SELECT @id, 
             Getdate(), 
             CASE 
               WHEN Len(@str) < 250 THEN Replicate('0', 250-Len(@str)) + @str 
               ELSE @str 
             END, 
             1; 

      SET @cont=@cont + 1; 
  END; 

SELECT Datediff(millisecond, @dtinicio, Getdate()) AS 
       [Tempo para inserir 100.000]; 


1. The first thing that draw my attention the most: the time it took to load the data:


  • Using VARCHAR: 82996
  • Using INT IDENTITY: 60796

Wow... that's a lot if may be considering bulk loading some data on your database.

2. Second thing, I measured: statistics and time on a simple SELECT on these tables. I used this query for testing:


SET statistics io ON; 

SELECT * 
FROM   customer c 
       JOIN customerdetails cd 
         ON ( c.id = cd.customerid ); 

SET statistics io OFF; 



  • And these were the results, using varchar:

Ten seconds to execute the query.

Table 'Customer'. Scan count 0, logical reads 1218759, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'CustomerDetails'. Scan count 1, logical reads 23085, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

ObjectName IndexDescription                 IndexSize(MB)
Customer         CLUSTERED INDEX 54.4
CustomerDetails HEAP         149.3

  • Using INT
Five Seconds to execute the query.

Table 'Customer'. Scan count 0, logical reads 918759, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'CustomerDetails'. Scan count 1, logical reads 10715, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

ObjectName IndexDescription                 IndexSize(MB)
Customer         CLUSTERED INDEX 28.9
CustomerDetails HEAP         78.4


Numbers dont lie folks.


Enough said by me... what is your opinion?


Check out my new book about R Language http://www.amazon.com/dp/B00SX6WA06

Um comentário:

  1. This is the query I used to measure index size ...

    SELECT [DatabaseName]
    ,[ObjectId]
    ,[ObjectName]
    ,[IndexId]
    ,[IndexDescription]
    ,CONVERT(DECIMAL(16, 1), (SUM([avg_record_size_in_bytes] * [record_count]) / (1024.0 * 1024))) AS [IndexSize(MB)]
    --,[lastupdated] AS [StatisticLastUpdated]
    --,[AvgFragmentationInPercent]
    FROM (
    SELECT DISTINCT DB_Name(Database_id) AS 'DatabaseName'
    ,OBJECT_ID AS ObjectId
    ,Object_Name(Object_id) AS ObjectName
    ,Index_ID AS IndexId
    ,Index_Type_Desc AS IndexDescription
    ,avg_record_size_in_bytes
    ,record_count
    ,STATS_DATE(object_id, index_id) AS 'lastupdated'
    ,CONVERT([varchar](512), round(Avg_Fragmentation_In_Percent, 3)) AS 'AvgFragmentationInPercent'
    FROM sys.dm_db_index_physical_stats(db_id(), NULL, NULL, NULL, 'detailed')
    WHERE OBJECT_ID IS NOT NULL
    AND Avg_Fragmentation_In_Percent <> 0
    ) T
    GROUP BY [DatabaseName]
    ,[ObjectId]
    ,[ObjectName]
    ,[IndexId]
    ,[IndexDescription]

    ResponderExcluir

Leave your comment here!