PostgreSQL - UUID vs TEXT

In PostgreSQL one can represent UUID as UUID, TEXT or VARCHAR built-in types.

Which type represents a UUID more efficiently? Mainly in the JSONB context?

My gut feeling says built-in type UUID is way better.

One trap with using TEXT is trying to compare different cases with equals:

select 'ef9f94da-98ef-49fa-8224-32f3e1f592b3' = 'EF9F94DA-98EF-49FA-8224-32F3E1F592B3' as equal;
 equal
-------
 f

Checking the relevant RFC4122 section:

Each field is treated as an integer and has its value printed as a
zero-filled hexadecimal digit string with the most significant
digit first.  The hexadecimal values "a" through "f" are output as
lower case characters and are case insensitive on input.

Which the UUID type correctly handles:

select 'ef9f94da-98ef-49fa-8224-32f3e1f592b3'::uuid = 'EF9F94DA-98EF-49FA-8224-32F3E1F592B3'::uuid as equal;
 equal
-------
 t

Looking at type size comparison:

select
    pg_column_size('00000000-0000-0000-0000-000000000000'::UUID) as uuid,
    pg_column_size('00000000-0000-0000-0000-000000000000'::TEXT) as text,
    pg_column_size('00000000-0000-0000-0000-000000000000'::VARCHAR(36)) as char;

 uuid | text | char
------+------+------
   16 |   40 |   40
(1 row)

In PostgreSQL text and varchar are the same: when using varchar(36) we are defining a text column with a check constraint of 36 characters.

Managing UUID as a proper UUID in PostgreSQL seems like a good idea, as PostgreSQL represents it in 128 bits (or 16 bytes).

What does the query planner say?

Generating some data to run comparisons:

drop table if exists a;
drop table if exists b;
drop table if exists c;

create unlogged table if not exists a (
    id   uuid primary key,
    data json
);

create unlogged table if not exists b (
    id   uuid primary key,
    data json
);

create unlogged table if not exists c (
    id   uuid primary key,
    data json
);

with insert_a as (
  insert into a
  select a_id, jsonb_build_object('foo', 'foobar')
  from (select gen_random_uuid() as a_id from generate_series(1, 1e4)) as data
  on conflict do nothing
  returning a.id
),
insert_b as (
  insert into b
  select b_id, jsonb_build_object('foo', 'foobar', 'a', a_id)
  from
  (select gen_random_uuid() as b_id  from generate_series(1, 1e4)) as data,
  (select id as a_id from insert_a) as data_a
  on conflict do nothing
  returning b.id
),
insert_c as (
  insert into c
  select c_id, jsonb_build_object('foo', 'foobar', 'b', b_id)
  from
  (select gen_random_uuid() as c_id from generate_series(1, 1e4)) as data,
  (select id as b_id from insert_b) as data_b
  on conflict do nothing
  returning c.id
)
select count(*) from insert_c;

With this setup the following number of rows were inserted:

select count(*) from a
union all
select count(*) from b
union all
select count(*) from c;

 count
-------
 10000
 10000
 10000

Creating a baseline:

explain analyze select * from c
left join b on b.id::text = c.data->>'b'
left join a on a.id::text = b.data->>'a'
where c.id = '685c304c-2ce0-4e08-8b97-4b32d6dd4e78';

                                                            QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
 Hash Right Join  (cost=273.67..757.66 rows=2500 width=192) (actual time=157.340..299.387 rows=1 loops=1)
   Hash Cond: ((a.id)::text = (b.data ->> 'a'::text))
   ->  Seq Scan on a  (cost=0.00..184.00 rows=10000 width=34) (actual time=0.014..67.927 rows=10000 loops=1)
   ->  Hash  (cost=273.04..273.04 rows=50 width=158) (actual time=157.255..157.312 rows=1 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 9kB
         ->  Hash Right Join  (cost=8.31..273.04 rows=50 width=158) (actual time=0.149..157.268 rows=1 loops=1)
               Hash Cond: ((b.id)::text = (c.data ->> 'b'::text))
               ->  Seq Scan on b  (cost=0.00..234.00 rows=10000 width=79) (actual time=0.025..74.235 rows=10000 loops=1)
               ->  Hash  (cost=8.30..8.30 rows=1 width=79) (actual time=0.091..0.110 rows=1 loops=1)
                     Buckets: 1024  Batches: 1  Memory Usage: 9kB
                     ->  Index Scan using c_pkey on c  (cost=0.29..8.30 rows=1 width=79) (actual time=0.045..0.064 rows=1 loops=1)
                           Index Cond: (id = '685c304c-2ce0-4e08-8b97-4b32d6dd4e78'::uuid)
 Planning Time: 0.175 ms
 Execution Time: 299.458 ms

Setting timing on:

 \timing

select * from c
left join b on b.id::text = c.data->>'b'
left join a on a.id::text = b.data->>'a'
where c.id = '685c304c-2ce0-4e08-8b97-4b32d6dd4e78';

-[ RECORD 1 ]--------------------------------------------------------
id   | 685c304c-2ce0-4e08-8b97-4b32d6dd4e78
data | {"b": "e907b6c8-a41a-47d2-b312-4f118d69c030", "foo": "foobar"}
id   | e907b6c8-a41a-47d2-b312-4f118d69c030
data | {"a": "acfa1f62-c271-4ad4-9daf-818c4921fb0e", "foo": "foobar"}
id   | acfa1f62-c271-4ad4-9daf-818c4921fb0e
data | {"foo": "foobar"}

Time: 8.473 ms

Another interesting query to check is getting all a relations:

explain analyze select count(*) from a
left join b on b.data->>'a' = a.id::text
left join c on c.data->>'b' = b.id::text
where a.id = 'acfa1f62-c271-4ad4-9daf-818c4921fb0e';
                                                            QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=809.91..809.92 rows=1 width=8) (actual time=750.923..751.022 rows=1 loops=1)
   ->  Hash Right Join  (cost=269.67..803.66 rows=2500 width=0) (actual time=320.674..618.406 rows=19999 loops=1)
         Hash Cond: ((c.data ->> 'b'::text) = (b.id)::text)
         ->  Seq Scan on c  (cost=0.00..234.00 rows=10000 width=63) (actual time=0.014..68.143 rows=10000 loops=1)
         ->  Hash  (cost=269.04..269.04 rows=50 width=16) (actual time=320.630..320.686 rows=10000 loops=1)
               Buckets: 16384 (originally 1024)  Batches: 1 (originally 1)  Memory Usage: 597kB
               ->  Hash Right Join  (cost=4.32..269.04 rows=50 width=16) (actual time=0.151..242.404 rows=10000 loops=1)
                     Hash Cond: ((b.data ->> 'a'::text) = (a.id)::text)
                     ->  Seq Scan on b  (cost=0.00..234.00 rows=10000 width=79) (actual time=0.020..70.547 rows=10000 loops=1)
                     ->  Hash  (cost=4.30..4.30 rows=1 width=16) (actual time=0.067..0.086 rows=1 loops=1)
                           Buckets: 1024  Batches: 1  Memory Usage: 9kB
                           ->  Index Only Scan using a_pkey on a  (cost=0.29..4.30 rows=1 width=16) (actual time=0.023..0.040 rows=1 loops=1)
                                 Index Cond: (id = 'acfa1f62-c271-4ad4-9daf-818c4921fb0e'::uuid)
                                 Heap Fetches: 0
 Planning Time: 0.169 ms
 Execution Time: 751.099 ms
(16 rows)

Time: 752.163 ms

select count(*) from a
left join b on b.data->>'a' = a.id::text
left join c on c.data->>'b' = b.id::text
where a.id = 'acfa1f62-c271-4ad4-9daf-818c4921fb0e';
-[ RECORD 1 ]
count | 19999

Time: 53.901 ms

Text Index

If we create an index on data as text:

create index if not exists b_a_text on b(((data->>'a')));
create index if not exists c_b_text on c(((data->>'b')));
vacuum analyze;

explain analyse select * from c
left join b on b.id::text = c.data->>'b'
left join a on a.id::text = b.data->>'a'
where c.id = '685c304c-2ce0-4e08-8b97-4b32d6dd4e78';

                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Hash Left Join  (cost=317.31..857.16 rows=2500 width=192) (actual time=158.851..300.858 rows=1 loops=1)
   Hash Cond: ((b.data ->> 'a'::text) = (a.id)::text)
   ->  Hash Right Join  (cost=8.31..273.04 rows=50 width=158) (actual time=0.127..142.084 rows=1 loops=1)
         Hash Cond: ((b.id)::text = (c.data ->> 'b'::text))
         ->  Seq Scan on b  (cost=0.00..234.00 rows=10000 width=79) (actual time=0.012..67.688 rows=10000 loops=1)
         ->  Hash  (cost=8.30..8.30 rows=1 width=79) (actual time=0.071..0.090 rows=1 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 9kB
               ->  Index Scan using c_pkey on c  (cost=0.29..8.30 rows=1 width=79) (actual time=0.023..0.041 rows=1 loops=1)
                     Index Cond: (id = '685c304c-2ce0-4e08-8b97-4b32d6dd4e78'::uuid)
   ->  Hash  (cost=184.00..184.00 rows=10000 width=34) (actual time=158.673..158.692 rows=10000 loops=1)
         Buckets: 16384  Batches: 1  Memory Usage: 773kB
         ->  Seq Scan on a  (cost=0.00..184.00 rows=10000 width=34) (actual time=0.012..75.828 rows=10000 loops=1)
 Planning Time: 0.302 ms
 Execution Time: 301.021 ms

select * from c
left join b on b.id::text = c.data->>'b'
left join a on a.id::text = b.data->>'a'
where c.id = '685c304c-2ce0-4e08-8b97-4b32d6dd4e78';

-[ RECORD 1 ]--------------------------------------------------------
id   | 685c304c-2ce0-4e08-8b97-4b32d6dd4e78
data | {"b": "e907b6c8-a41a-47d2-b312-4f118d69c030", "foo": "foobar"}
id   | e907b6c8-a41a-47d2-b312-4f118d69c030
data | {"a": "acfa1f62-c271-4ad4-9daf-818c4921fb0e", "foo": "foobar"}
id   | acfa1f62-c271-4ad4-9daf-818c4921fb0e
data | {"foo": "foobar"}

Time: 8.894 ms

Getting all a relations:

explain analyze select count(*) from a
left join b on b.data->>'a' = a.id::text
left join c on c.data->>'b' = b.id::text
where a.id = 'acfa1f62-c271-4ad4-9daf-818c4921fb0e';

                                                                  QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=764.24..764.25 rows=1 width=8) (actual time=1038.579..1038.680 rows=1 loops=1)
   ->  Merge Right Join  (cost=270.74..757.99 rows=2500 width=0) (actual time=333.769..903.920 rows=19999 loops=1)
         Merge Cond: ((c.data ->> 'b'::text) = ((b.id)::text))
         ->  Index Scan using c_b_text on c  (cost=0.29..331.29 rows=10000 width=63) (actual time=0.029..74.342 rows=10000 loops=1)
         ->  Sort  (cost=270.45..270.58 rows=50 width=16) (actual time=333.589..467.134 rows=19999 loops=1)
               Sort Key: ((b.id)::text)
               Sort Method: quicksort  Memory: 1791kB
               ->  Hash Right Join  (cost=4.32..269.04 rows=50 width=16) (actual time=0.148..245.843 rows=10000 loops=1)
                     Hash Cond: ((b.data ->> 'a'::text) = (a.id)::text)
                     ->  Seq Scan on b  (cost=0.00..234.00 rows=10000 width=79) (actual time=0.023..70.893 rows=10000 loops=1)
                     ->  Hash  (cost=4.30..4.30 rows=1 width=16) (actual time=0.077..0.097 rows=1 loops=1)
                           Buckets: 1024  Batches: 1  Memory Usage: 9kB
                           ->  Index Only Scan using a_pkey on a  (cost=0.29..4.30 rows=1 width=16) (actual time=0.022..0.043 rows=1 loops=1)
                                 Index Cond: (id = 'acfa1f62-c271-4ad4-9daf-818c4921fb0e'::uuid)
                                 Heap Fetches: 0
 Planning Time: 0.240 ms
 Execution Time: 1038.838 ms
(17 rows)

Time: 1039.818 ms (00:01.040)

select count(*) from a
left join b on b.data->>'a' = a.id::text
left join c on c.data->>'b' = b.id::text
where a.id = 'acfa1f62-c271-4ad4-9daf-818c4921fb0e';
-[ RECORD 1 ]
count | 19999

Time: 60.985 ms

In this case it doesn’t seem like the index is even useful, as we are getting a value similiar to the baseline.

UUID Index

If we create an index on data as uuid:

create index if not exists b_a_uuid on b(((data->>'a')::uuid));
create index if not exists c_b_uuid on c(((data->>'b')::uuid));
vacuum analyze;

explain analyze select * from c
left join b on b.id = (c.data->>'b')::uuid
left join a on a.id = (b.data->>'a')::uuid
where c.id = '685c304c-2ce0-4e08-8b97-4b32d6dd4e78';
                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Nested Loop Left Join  (cost=0.87..16.98 rows=1 width=192) (actual time=0.146..0.293 rows=1 loops=1)
   ->  Nested Loop Left Join  (cost=0.58..16.61 rows=1 width=158) (actual time=0.094..0.177 rows=1 loops=1)
         ->  Index Scan using c_pkey on c  (cost=0.29..8.30 rows=1 width=79) (actual time=0.023..0.045 rows=1 loops=1)
               Index Cond: (id = '685c304c-2ce0-4e08-8b97-4b32d6dd4e78'::uuid)
         ->  Index Scan using b_pkey on b  (cost=0.29..8.31 rows=1 width=79) (actual time=0.026..0.037 rows=1 loops=1)
               Index Cond: (id = ((c.data ->> 'b'::text))::uuid)
   ->  Index Scan using a_pkey on a  (cost=0.29..0.37 rows=1 width=34) (actual time=0.020..0.032 rows=1 loops=1)
         Index Cond: (id = ((b.data ->> 'a'::text))::uuid)
 Planning Time: 0.258 ms
 Execution Time: 0.366 ms
(10 rows)

select * from c
left join b on b.id = (c.data->>'b')::uuid
left join a on a.id = (b.data->>'a')::uuid
where c.id = '685c304c-2ce0-4e08-8b97-4b32d6dd4e78';

-[ RECORD 1 ]--------------------------------------------------------
id   | 685c304c-2ce0-4e08-8b97-4b32d6dd4e78
data | {"b": "e907b6c8-a41a-47d2-b312-4f118d69c030", "foo": "foobar"}
id   | e907b6c8-a41a-47d2-b312-4f118d69c030
data | {"a": "acfa1f62-c271-4ad4-9daf-818c4921fb0e", "foo": "foobar"}
id   | acfa1f62-c271-4ad4-9daf-818c4921fb0e
data | {"foo": "foobar"}

Time: 1.044 ms

Getting all a relations:

explain analyze select count(*) from a
left join b on (b.data->>'a')::uuid = a.id
left join c on (c.data->>'b')::uuid = b.id
where a.id = 'acfa1f62-c271-4ad4-9daf-818c4921fb0e';

                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=927.33..927.34 rows=1 width=8) (actual time=974.052..974.152 rows=1 loops=1)
   ->  Hash Right Join  (cost=463.31..902.33 rows=10000 width=0) (actual time=179.660..836.708 rows=19999 loops=1)
         Hash Cond: (((b.data ->> 'a'::text))::uuid = a.id)
         ->  Hash Right Join  (cost=459.00..723.01 rows=10000 width=63) (actual time=179.555..492.140 rows=19999 loops=1)
               Hash Cond: (((c.data ->> 'b'::text))::uuid = b.id)
               ->  Seq Scan on c  (cost=0.00..234.00 rows=10000 width=63) (actual time=0.014..68.855 rows=10000 loops=1)
               ->  Hash  (cost=334.00..334.00 rows=10000 width=79) (actual time=179.496..179.514 rows=10000 loops=1)
                     Buckets: 16384  Batches: 1  Memory Usage: 1212kB
                     ->  Seq Scan on b  (cost=0.00..334.00 rows=10000 width=79) (actual time=0.034..100.641 rows=10000 loops=1)
                           Filter: (((data ->> 'a'::text))::uuid = 'acfa1f62-c271-4ad4-9daf-818c4921fb0e'::uuid)
         ->  Hash  (cost=4.30..4.30 rows=1 width=16) (actual time=0.070..0.089 rows=1 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 9kB
               ->  Index Only Scan using a_pkey on a  (cost=0.29..4.30 rows=1 width=16) (actual time=0.025..0.043 rows=1 loops=1)
                     Index Cond: (id = 'acfa1f62-c271-4ad4-9daf-818c4921fb0e'::uuid)
                     Heap Fetches: 0
 Planning Time: 0.374 ms
 Execution Time: 974.242 ms
(17 rows)

Time: 975.312 ms

In this case the planner is deciding to use a Hash Right Join which is suspicious.

Trying the same query with hashjoin disabled:

set enable_hashjoin=off;

explain analyze select * from a
left join b on (b.data->>'a')::uuid = a.id
left join c on (c.data->>'b')::uuid = b.id
where a.id = 'acfa1f62-c271-4ad4-9daf-818c4921fb0e';
                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop Left Join  (cost=0.86..1169.87 rows=10000 width=192) (actual time=0.160..777.422 rows=19999 loops=1)
   Join Filter: (((b.data ->> 'a'::text))::uuid = a.id)
   ->  Index Scan using a_pkey on a  (cost=0.29..2.30 rows=1 width=34) (actual time=0.031..0.046 rows=1 loops=1)
         Index Cond: (id = 'acfa1f62-c271-4ad4-9daf-818c4921fb0e'::uuid)
   ->  Merge Left Join  (cost=0.57..967.57 rows=10000 width=158) (actual time=0.089..460.686 rows=19999 loops=1)
         Merge Cond: (b.id = ((c.data ->> 'b'::text))::uuid)
         ->  Index Scan using b_aa_uuid on b  (cost=0.29..372.29 rows=10000 width=79) (actual time=0.024..85.402 rows=10000 loops=1)
               Index Cond: (((data ->> 'a'::text))::uuid = 'acfa1f62-c271-4ad4-9daf-818c4921fb0e'::uuid)
         ->  Index Scan using c_b_uuid on c  (cost=0.29..295.29 rows=10000 width=79) (actual time=0.022..73.415 rows=10000 loops=1)
 Planning Time: 0.262 ms
 Execution Time: 912.839 ms
(11 rows)

select count(*) from a
left join b on (b.data->>'a')::uuid = a.id
left join c on (c.data->>'b')::uuid = b.id
where a.id = 'acfa1f62-c271-4ad4-9daf-818c4921fb0e';
-[ RECORD 1 ]
count | 19999

Time: 57.294 ms

Disabling hashjoin also improves the performance in case where we are using text:

explain analyse select count(*) from a
left join b on b.data->>'a' = a.id::text
left join c on c.data->>'b' = b.id::text
where a.id = 'acfa1f62-c271-4ad4-9daf-818c4921fb0e';
                                                                   QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=856.31..856.32 rows=1 width=8) (actual time=1025.381..1025.522 rows=1 loops=1)
   ->  Merge Right Join  (cost=398.81..850.06 rows=2500 width=0) (actual time=326.140..890.957 rows=19999 loops=1)
         Merge Cond: ((c.data ->> 'b'::text) = ((b.id)::text))
         ->  Index Scan using c_b_text on c  (cost=0.29..295.29 rows=10000 width=63) (actual time=0.036..72.234 rows=10000 loops=1)
         ->  Sort  (cost=398.52..398.65 rows=50 width=16) (actual time=326.058..460.792 rows=19999 loops=1)
               Sort Key: ((b.id)::text)
               Sort Method: quicksort  Memory: 1791kB
               ->  Merge Right Join  (cost=1.60..397.11 rows=50 width=16) (actual time=0.141..239.702 rows=10000 loops=1)
                     Merge Cond: ((b.data ->> 'a'::text) = ((a.id)::text))
                     ->  Index Scan using b_a_text on b  (cost=0.29..295.29 rows=10000 width=79) (actual time=0.024..76.033 rows=10000 loops=1)
                     ->  Sort  (cost=1.31..1.32 rows=1 width=16) (actual time=0.080..0.105 rows=1 loops=1)
                           Sort Key: ((a.id)::text)
                           Sort Method: quicksort  Memory: 25kB
                           ->  Index Only Scan using a_pkey on a  (cost=0.29..1.30 rows=1 width=16) (actual time=0.029..0.048 rows=1 loops=1)
                                 Index Cond: (id = 'acfa1f62-c271-4ad4-9daf-818c4921fb0e'::uuid)
                                 Heap Fetches: 0
 Planning Time: 0.371 ms
 Execution Time: 1025.622 ms
(18 rows)

select count(*) from a
left join b on b.data->>'a' = a.id::text
left join c on c.data->>'b' = b.id::text
where a.id = 'acfa1f62-c271-4ad4-9daf-818c4921fb0e';
 count
-------
 19999
(1 row)

Time: 51.609 ms

Massive improvement with the uuid index. 1.044ms vs 8.894 ms which is way faster and we’re fully utilising the index we just created.

However, when getting all a relations, performance wise it is similar when hashjoin is disabled. The lack of column statistics for JSONB columns may be the culprit here.

Conclusion

Using the built-in type UUID is more efficient at the cost of a slightly more complicated query writing, with the extra benefit of avoiding handling possible pitfalls when comparing UUIDs.