quinta-feira, 17 de maio de 2018

Converting a ASC character column to UTF-8

On a recent task, I had to import a simple text file for data enrichment purposes, and the "name" column when imported to SQL SERVER, got messed up, event after I changed the file encoding to UTF-8.

So basically, the data was like this:

"Jão Pedro Valênciano"

When it should be like this:

"João Pedro Valenciano"

After some googling, I found some dead end alleys and a good answer. On the SQL TEAM forum there is thread on the subject, proposing a function that solves the issue.

Bellow is the function and the link:

CREATE FUNCTION dbo.Utf8_to_utf16 (@s VARCHAR(8000)) 
returns NVARCHAR(4000) 
  BEGIN 
      IF @s IS NULL 
        RETURN NULL 

      DECLARE @n      INT, 
              @r      NVARCHAR(4000), 
              @cn     INT, 
              @octets INT, 
              @ch     NVARCHAR(2) 

      SET @r = N'' 

      WHILE 1 = 1 
        BEGIN 
            -- dubious: unexpected octets (0x80-0xBF, 0xF8-0xFF) are treated like 0x00-0x7F 
            SET @n = Patindex('%[ร€-รท]%', @s COLLATE latin1_general_bin 
                     ) 

            IF @n = 0 
              BEGIN 
                  SET @r = @r + @s 

                  BREAK 
              END 
            ELSE 
              BEGIN 
                  SET @r = @r + Substring(@s, 1, @n-1) 
                  SET @cn = Ascii(Substring(@s, @n, 1)) 

                  IF @cn <= 0xDF 
                    BEGIN 
                        SET @octets = 2 
                        SET @ch = Nchar(( @cn & 0x1F ) * 0x40 + ( Ascii( 
                                        Substring(@s, @n + 1, 1)) 
                                                                  & 
                                                                  0x3F )) 
                    END 
                  ELSE IF @cn <= 0xEF 
                    BEGIN 
                        SET @octets = 3 
                        SET @ch = Nchar(( @cn & 0x0F ) * 0x1000 + ( Ascii( 
                                        Substring(@s, @n + 1, 1) 
                                                                    ) 
                                                                    & 0x3F ) * 
                                                                            0x40 
                                        + 
                                        ( 
                                                  Ascii(Substring(@s, @n + 2, 1) 
                                                  ) 
                                                  & 
                                                  0x3F 
                                        )) 
                    END 
                  ELSE 
                    BEGIN 
                        -- code point in a supplementary plane: output UTF-16 surrogate pair 
                        SET @octets = 4 
                        SET @ch = Nchar((@cn & 0x07) * 0x100 + (Ascii(Substring( 
                                  @s 
                                  , @n 
                                  +1 
                                  , 1) 
                                  ) 
                                  & 
                                  0x3F 
                                  ) * 0x04 + ( 
                                  Ascii(Substring(@s, @n+2, 1)) & 0x30) / 0x10 + 
                                  0xD7C0) 
                                  + Nchar((Ascii(Substring(@s, @n+2, 1)) & 0x0F) 
                                  * 
                                  0x40 
                                  + ( 
                                  Ascii( 
                                  Substring(@s, @n+3, 1)) & 0x3F) + 0xDC00) 
                    END 

                  SET @r = @r + @ch 
                  SET @s = Substring(@s, @n + @octets, 8000) 
              END 
        END 

      RETURN @r 
  END 

go 


Link to it: http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=62406

Use it like this:
SELECT dbo.utf8_to_utf16 ('João Pedro Valênciano')

The result is:
João Pedro Valênciano


Check out my new book about R Language http://www.amazon.com/dp/B00SX6WA06

Nenhum comentário:

Postar um comentário

Leave your comment here!