quinta-feira, 15 de fevereiro de 2018

Best way to count words in a string using T-SQL

On this recent project, I needed to count words in a string, in order to measure its general "quality", for a Data Quality project.

Sounded simple enough, but this is a massive table with milions of rows so performance is a requirement.

I created I function of my own, using this answer on StackOverBlow: https://stackoverflow.com/questions/15352990/get-word-count-of-a-column-using-sql

CREATE FUNCTION [dbo].[Wordcount] (@InputString VARCHAR(4000)) 
returns INT 
AS 
  BEGIN 
      DECLARE @Index INT 
      DECLARE @Char CHAR(1) 
      DECLARE @PrevChar CHAR(1) 
      DECLARE @WordCount INT 

      SET @Index = 1 
      SET @WordCount = 0 

      WHILE @Index <= Len(@InputString) 
        BEGIN 
            SET @Char = Substring(@InputString, @Index, 1) 
            SET @PrevChar = CASE 
                              WHEN @Index = 1 THEN ' ' 
                              ELSE Substring(@InputString, @Index - 1, 1) 
                            END 

            IF @PrevChar = ' ' 
               AND @Char != ' ' 
              SET @WordCount = @WordCount + 1 

            SET @Index = @Index + 1 
        END 

      RETURN @WordCount 
  END 

go 

Althought it is a correct answer to the problem, it's not performactly good. So after doing some more googling, I found this answer much more simpler: https://stackoverflow.com/questions/41952250/sql-string-counting-words-inside-a-string

INSERT INTO @t 
VALUES      ('I am not your father.'), 
            ('Where are your brother,sister,mother?'), 
            ('Where are your brother, sister and mother?'), 
            ('Who are     you?'), 
            (''); 

WITH cte 
     AS (SELECT id, 
                Replace(Replace(string, ' ', '><'), '<>', '') string 
         FROM   @t) 
SELECT id, 
       Len(' ' + string) - Len(Replace(string, '><', ' ')) CountWords 
FROM   cte 


After tweaking it to handle null and empty values, my function looked like this:

CREATE FUNCTION [dbo].[Fn_wordcount] (@InputString VARCHAR(4000)) 
returns INT 
WITH schemabinding 
AS 
  BEGIN 
      DECLARE @WordCount INT = 0; 

      IF @InputString IS NULL 
        RETURN 0 

      SELECT @InputString = Replace(Replace(@InputString, ' ', '><'), '<>', ''); 

      SELECT @WordCount = Len(' ' + @InputString) - Len( 
                          Replace(@InputString, '><', ' ')); 

      SELECT @WordCount = Isnull(@WordCount, 0); 

      RETURN @WordCount 
  END 

go 

I'm still observing the performance, the the first run looked ok. I will get back on this post after properly measuring a full load.

Check out my new book about R Language http://www.amazon.com/dp/B00SX6WA06

Nenhum comentário:

Postar um comentário

Leave your comment here!