PostgreSQL - UUID vs TEXT
In PostgreSQL one can represent UUID as UUID
, TEXT
or VARCHAR
built-in types.
Which type represents a UUID more efficiently? Mainly in the JSONB context?
My gut feeling says built-in type UUID
is way better.
One trap with using TEXT
is trying to compare different cases with equals:
select 'ef9f94da-98ef-49fa-8224-32f3e1f592b3' = 'EF9F94DA-98EF-49FA-8224-32F3E1F592B3' as equal;
equal
-------
f
Checking the relevant RFC4122 section:
Each field is treated as an integer and has its value printed as a
zero-filled hexadecimal digit string with the most significant
digit first. The hexadecimal values "a" through "f" are output as
lower case characters and are case insensitive on input.
Which the UUID
type correctly handles:
select 'ef9f94da-98ef-49fa-8224-32f3e1f592b3'::uuid = 'EF9F94DA-98EF-49FA-8224-32F3E1F592B3'::uuid as equal;
equal
-------
t
Looking at type size comparison:
select
pg_column_size('00000000-0000-0000-0000-000000000000'::UUID) as uuid,
pg_column_size('00000000-0000-0000-0000-000000000000'::TEXT) as text,
pg_column_size('00000000-0000-0000-0000-000000000000'::VARCHAR(36)) as char;
uuid | text | char
------+------+------
16 | 40 | 40
(1 row)
In PostgreSQL text
and varchar
are the same: when using varchar(36)
we are defining a text
column with a check constraint of 36 characters.
Managing UUID
as a proper UUID
in PostgreSQL seems like a good idea, as PostgreSQL represents it in 128 bits (or 16 bytes).
What does the query planner say?
Generating some data to run comparisons:
drop table if exists a;
drop table if exists b;
drop table if exists c;
create unlogged table if not exists a (
id uuid primary key,
data json
);
create unlogged table if not exists b (
id uuid primary key,
data json
);
create unlogged table if not exists c (
id uuid primary key,
data json
);
with insert_a as (
insert into a
select a_id, jsonb_build_object('foo', 'foobar')
from (select gen_random_uuid() as a_id from generate_series(1, 1e4)) as data
on conflict do nothing
returning a.id
),
insert_b as (
insert into b
select b_id, jsonb_build_object('foo', 'foobar', 'a', a_id)
from
(select gen_random_uuid() as b_id from generate_series(1, 1e4)) as data,
(select id as a_id from insert_a) as data_a
on conflict do nothing
returning b.id
),
insert_c as (
insert into c
select c_id, jsonb_build_object('foo', 'foobar', 'b', b_id)
from
(select gen_random_uuid() as c_id from generate_series(1, 1e4)) as data,
(select id as b_id from insert_b) as data_b
on conflict do nothing
returning c.id
)
select count(*) from insert_c;
With this setup the following number of rows were inserted:
select count(*) from a
union all
select count(*) from b
union all
select count(*) from c;
count
-------
10000
10000
10000
Creating a baseline:
explain analyze select * from c
left join b on b.id::text = c.data->>'b'
left join a on a.id::text = b.data->>'a'
where c.id = '685c304c-2ce0-4e08-8b97-4b32d6dd4e78';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Hash Right Join (cost=273.67..757.66 rows=2500 width=192) (actual time=157.340..299.387 rows=1 loops=1)
Hash Cond: ((a.id)::text = (b.data ->> 'a'::text))
-> Seq Scan on a (cost=0.00..184.00 rows=10000 width=34) (actual time=0.014..67.927 rows=10000 loops=1)
-> Hash (cost=273.04..273.04 rows=50 width=158) (actual time=157.255..157.312 rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Hash Right Join (cost=8.31..273.04 rows=50 width=158) (actual time=0.149..157.268 rows=1 loops=1)
Hash Cond: ((b.id)::text = (c.data ->> 'b'::text))
-> Seq Scan on b (cost=0.00..234.00 rows=10000 width=79) (actual time=0.025..74.235 rows=10000 loops=1)
-> Hash (cost=8.30..8.30 rows=1 width=79) (actual time=0.091..0.110 rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Index Scan using c_pkey on c (cost=0.29..8.30 rows=1 width=79) (actual time=0.045..0.064 rows=1 loops=1)
Index Cond: (id = '685c304c-2ce0-4e08-8b97-4b32d6dd4e78'::uuid)
Planning Time: 0.175 ms
Execution Time: 299.458 ms
Setting timing on:
\timing
select * from c
left join b on b.id::text = c.data->>'b'
left join a on a.id::text = b.data->>'a'
where c.id = '685c304c-2ce0-4e08-8b97-4b32d6dd4e78';
-[ RECORD 1 ]--------------------------------------------------------
id | 685c304c-2ce0-4e08-8b97-4b32d6dd4e78
data | {"b": "e907b6c8-a41a-47d2-b312-4f118d69c030", "foo": "foobar"}
id | e907b6c8-a41a-47d2-b312-4f118d69c030
data | {"a": "acfa1f62-c271-4ad4-9daf-818c4921fb0e", "foo": "foobar"}
id | acfa1f62-c271-4ad4-9daf-818c4921fb0e
data | {"foo": "foobar"}
Time: 8.473 ms
Another interesting query to check is getting all a
relations:
explain analyze select count(*) from a
left join b on b.data->>'a' = a.id::text
left join c on c.data->>'b' = b.id::text
where a.id = 'acfa1f62-c271-4ad4-9daf-818c4921fb0e';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=809.91..809.92 rows=1 width=8) (actual time=750.923..751.022 rows=1 loops=1)
-> Hash Right Join (cost=269.67..803.66 rows=2500 width=0) (actual time=320.674..618.406 rows=19999 loops=1)
Hash Cond: ((c.data ->> 'b'::text) = (b.id)::text)
-> Seq Scan on c (cost=0.00..234.00 rows=10000 width=63) (actual time=0.014..68.143 rows=10000 loops=1)
-> Hash (cost=269.04..269.04 rows=50 width=16) (actual time=320.630..320.686 rows=10000 loops=1)
Buckets: 16384 (originally 1024) Batches: 1 (originally 1) Memory Usage: 597kB
-> Hash Right Join (cost=4.32..269.04 rows=50 width=16) (actual time=0.151..242.404 rows=10000 loops=1)
Hash Cond: ((b.data ->> 'a'::text) = (a.id)::text)
-> Seq Scan on b (cost=0.00..234.00 rows=10000 width=79) (actual time=0.020..70.547 rows=10000 loops=1)
-> Hash (cost=4.30..4.30 rows=1 width=16) (actual time=0.067..0.086 rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Index Only Scan using a_pkey on a (cost=0.29..4.30 rows=1 width=16) (actual time=0.023..0.040 rows=1 loops=1)
Index Cond: (id = 'acfa1f62-c271-4ad4-9daf-818c4921fb0e'::uuid)
Heap Fetches: 0
Planning Time: 0.169 ms
Execution Time: 751.099 ms
(16 rows)
Time: 752.163 ms
select count(*) from a
left join b on b.data->>'a' = a.id::text
left join c on c.data->>'b' = b.id::text
where a.id = 'acfa1f62-c271-4ad4-9daf-818c4921fb0e';
-[ RECORD 1 ]
count | 19999
Time: 53.901 ms
Text Index
If we create an index on data as text
:
create index if not exists b_a_text on b(((data->>'a')));
create index if not exists c_b_text on c(((data->>'b')));
vacuum analyze;
explain analyse select * from c
left join b on b.id::text = c.data->>'b'
left join a on a.id::text = b.data->>'a'
where c.id = '685c304c-2ce0-4e08-8b97-4b32d6dd4e78';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Hash Left Join (cost=317.31..857.16 rows=2500 width=192) (actual time=158.851..300.858 rows=1 loops=1)
Hash Cond: ((b.data ->> 'a'::text) = (a.id)::text)
-> Hash Right Join (cost=8.31..273.04 rows=50 width=158) (actual time=0.127..142.084 rows=1 loops=1)
Hash Cond: ((b.id)::text = (c.data ->> 'b'::text))
-> Seq Scan on b (cost=0.00..234.00 rows=10000 width=79) (actual time=0.012..67.688 rows=10000 loops=1)
-> Hash (cost=8.30..8.30 rows=1 width=79) (actual time=0.071..0.090 rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Index Scan using c_pkey on c (cost=0.29..8.30 rows=1 width=79) (actual time=0.023..0.041 rows=1 loops=1)
Index Cond: (id = '685c304c-2ce0-4e08-8b97-4b32d6dd4e78'::uuid)
-> Hash (cost=184.00..184.00 rows=10000 width=34) (actual time=158.673..158.692 rows=10000 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 773kB
-> Seq Scan on a (cost=0.00..184.00 rows=10000 width=34) (actual time=0.012..75.828 rows=10000 loops=1)
Planning Time: 0.302 ms
Execution Time: 301.021 ms
select * from c
left join b on b.id::text = c.data->>'b'
left join a on a.id::text = b.data->>'a'
where c.id = '685c304c-2ce0-4e08-8b97-4b32d6dd4e78';
-[ RECORD 1 ]--------------------------------------------------------
id | 685c304c-2ce0-4e08-8b97-4b32d6dd4e78
data | {"b": "e907b6c8-a41a-47d2-b312-4f118d69c030", "foo": "foobar"}
id | e907b6c8-a41a-47d2-b312-4f118d69c030
data | {"a": "acfa1f62-c271-4ad4-9daf-818c4921fb0e", "foo": "foobar"}
id | acfa1f62-c271-4ad4-9daf-818c4921fb0e
data | {"foo": "foobar"}
Time: 8.894 ms
Getting all a
relations:
explain analyze select count(*) from a
left join b on b.data->>'a' = a.id::text
left join c on c.data->>'b' = b.id::text
where a.id = 'acfa1f62-c271-4ad4-9daf-818c4921fb0e';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=764.24..764.25 rows=1 width=8) (actual time=1038.579..1038.680 rows=1 loops=1)
-> Merge Right Join (cost=270.74..757.99 rows=2500 width=0) (actual time=333.769..903.920 rows=19999 loops=1)
Merge Cond: ((c.data ->> 'b'::text) = ((b.id)::text))
-> Index Scan using c_b_text on c (cost=0.29..331.29 rows=10000 width=63) (actual time=0.029..74.342 rows=10000 loops=1)
-> Sort (cost=270.45..270.58 rows=50 width=16) (actual time=333.589..467.134 rows=19999 loops=1)
Sort Key: ((b.id)::text)
Sort Method: quicksort Memory: 1791kB
-> Hash Right Join (cost=4.32..269.04 rows=50 width=16) (actual time=0.148..245.843 rows=10000 loops=1)
Hash Cond: ((b.data ->> 'a'::text) = (a.id)::text)
-> Seq Scan on b (cost=0.00..234.00 rows=10000 width=79) (actual time=0.023..70.893 rows=10000 loops=1)
-> Hash (cost=4.30..4.30 rows=1 width=16) (actual time=0.077..0.097 rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Index Only Scan using a_pkey on a (cost=0.29..4.30 rows=1 width=16) (actual time=0.022..0.043 rows=1 loops=1)
Index Cond: (id = 'acfa1f62-c271-4ad4-9daf-818c4921fb0e'::uuid)
Heap Fetches: 0
Planning Time: 0.240 ms
Execution Time: 1038.838 ms
(17 rows)
Time: 1039.818 ms (00:01.040)
select count(*) from a
left join b on b.data->>'a' = a.id::text
left join c on c.data->>'b' = b.id::text
where a.id = 'acfa1f62-c271-4ad4-9daf-818c4921fb0e';
-[ RECORD 1 ]
count | 19999
Time: 60.985 ms
In this case it doesn’t seem like the index is even useful, as we are getting a value similiar to the baseline.
UUID Index
If we create an index on data as uuid
:
create index if not exists b_a_uuid on b(((data->>'a')::uuid));
create index if not exists c_b_uuid on c(((data->>'b')::uuid));
vacuum analyze;
explain analyze select * from c
left join b on b.id = (c.data->>'b')::uuid
left join a on a.id = (b.data->>'a')::uuid
where c.id = '685c304c-2ce0-4e08-8b97-4b32d6dd4e78';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Nested Loop Left Join (cost=0.87..16.98 rows=1 width=192) (actual time=0.146..0.293 rows=1 loops=1)
-> Nested Loop Left Join (cost=0.58..16.61 rows=1 width=158) (actual time=0.094..0.177 rows=1 loops=1)
-> Index Scan using c_pkey on c (cost=0.29..8.30 rows=1 width=79) (actual time=0.023..0.045 rows=1 loops=1)
Index Cond: (id = '685c304c-2ce0-4e08-8b97-4b32d6dd4e78'::uuid)
-> Index Scan using b_pkey on b (cost=0.29..8.31 rows=1 width=79) (actual time=0.026..0.037 rows=1 loops=1)
Index Cond: (id = ((c.data ->> 'b'::text))::uuid)
-> Index Scan using a_pkey on a (cost=0.29..0.37 rows=1 width=34) (actual time=0.020..0.032 rows=1 loops=1)
Index Cond: (id = ((b.data ->> 'a'::text))::uuid)
Planning Time: 0.258 ms
Execution Time: 0.366 ms
(10 rows)
select * from c
left join b on b.id = (c.data->>'b')::uuid
left join a on a.id = (b.data->>'a')::uuid
where c.id = '685c304c-2ce0-4e08-8b97-4b32d6dd4e78';
-[ RECORD 1 ]--------------------------------------------------------
id | 685c304c-2ce0-4e08-8b97-4b32d6dd4e78
data | {"b": "e907b6c8-a41a-47d2-b312-4f118d69c030", "foo": "foobar"}
id | e907b6c8-a41a-47d2-b312-4f118d69c030
data | {"a": "acfa1f62-c271-4ad4-9daf-818c4921fb0e", "foo": "foobar"}
id | acfa1f62-c271-4ad4-9daf-818c4921fb0e
data | {"foo": "foobar"}
Time: 1.044 ms
Getting all a
relations:
explain analyze select count(*) from a
left join b on (b.data->>'a')::uuid = a.id
left join c on (c.data->>'b')::uuid = b.id
where a.id = 'acfa1f62-c271-4ad4-9daf-818c4921fb0e';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=927.33..927.34 rows=1 width=8) (actual time=974.052..974.152 rows=1 loops=1)
-> Hash Right Join (cost=463.31..902.33 rows=10000 width=0) (actual time=179.660..836.708 rows=19999 loops=1)
Hash Cond: (((b.data ->> 'a'::text))::uuid = a.id)
-> Hash Right Join (cost=459.00..723.01 rows=10000 width=63) (actual time=179.555..492.140 rows=19999 loops=1)
Hash Cond: (((c.data ->> 'b'::text))::uuid = b.id)
-> Seq Scan on c (cost=0.00..234.00 rows=10000 width=63) (actual time=0.014..68.855 rows=10000 loops=1)
-> Hash (cost=334.00..334.00 rows=10000 width=79) (actual time=179.496..179.514 rows=10000 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 1212kB
-> Seq Scan on b (cost=0.00..334.00 rows=10000 width=79) (actual time=0.034..100.641 rows=10000 loops=1)
Filter: (((data ->> 'a'::text))::uuid = 'acfa1f62-c271-4ad4-9daf-818c4921fb0e'::uuid)
-> Hash (cost=4.30..4.30 rows=1 width=16) (actual time=0.070..0.089 rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Index Only Scan using a_pkey on a (cost=0.29..4.30 rows=1 width=16) (actual time=0.025..0.043 rows=1 loops=1)
Index Cond: (id = 'acfa1f62-c271-4ad4-9daf-818c4921fb0e'::uuid)
Heap Fetches: 0
Planning Time: 0.374 ms
Execution Time: 974.242 ms
(17 rows)
Time: 975.312 ms
In this case the planner is deciding to use a Hash Right Join
which is suspicious.
Trying the same query with hashjoin disabled:
set enable_hashjoin=off;
explain analyze select * from a
left join b on (b.data->>'a')::uuid = a.id
left join c on (c.data->>'b')::uuid = b.id
where a.id = 'acfa1f62-c271-4ad4-9daf-818c4921fb0e';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Nested Loop Left Join (cost=0.86..1169.87 rows=10000 width=192) (actual time=0.160..777.422 rows=19999 loops=1)
Join Filter: (((b.data ->> 'a'::text))::uuid = a.id)
-> Index Scan using a_pkey on a (cost=0.29..2.30 rows=1 width=34) (actual time=0.031..0.046 rows=1 loops=1)
Index Cond: (id = 'acfa1f62-c271-4ad4-9daf-818c4921fb0e'::uuid)
-> Merge Left Join (cost=0.57..967.57 rows=10000 width=158) (actual time=0.089..460.686 rows=19999 loops=1)
Merge Cond: (b.id = ((c.data ->> 'b'::text))::uuid)
-> Index Scan using b_aa_uuid on b (cost=0.29..372.29 rows=10000 width=79) (actual time=0.024..85.402 rows=10000 loops=1)
Index Cond: (((data ->> 'a'::text))::uuid = 'acfa1f62-c271-4ad4-9daf-818c4921fb0e'::uuid)
-> Index Scan using c_b_uuid on c (cost=0.29..295.29 rows=10000 width=79) (actual time=0.022..73.415 rows=10000 loops=1)
Planning Time: 0.262 ms
Execution Time: 912.839 ms
(11 rows)
select count(*) from a
left join b on (b.data->>'a')::uuid = a.id
left join c on (c.data->>'b')::uuid = b.id
where a.id = 'acfa1f62-c271-4ad4-9daf-818c4921fb0e';
-[ RECORD 1 ]
count | 19999
Time: 57.294 ms
Disabling hashjoin
also improves the performance in case where we are using text
:
explain analyse select count(*) from a
left join b on b.data->>'a' = a.id::text
left join c on c.data->>'b' = b.id::text
where a.id = 'acfa1f62-c271-4ad4-9daf-818c4921fb0e';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=856.31..856.32 rows=1 width=8) (actual time=1025.381..1025.522 rows=1 loops=1)
-> Merge Right Join (cost=398.81..850.06 rows=2500 width=0) (actual time=326.140..890.957 rows=19999 loops=1)
Merge Cond: ((c.data ->> 'b'::text) = ((b.id)::text))
-> Index Scan using c_b_text on c (cost=0.29..295.29 rows=10000 width=63) (actual time=0.036..72.234 rows=10000 loops=1)
-> Sort (cost=398.52..398.65 rows=50 width=16) (actual time=326.058..460.792 rows=19999 loops=1)
Sort Key: ((b.id)::text)
Sort Method: quicksort Memory: 1791kB
-> Merge Right Join (cost=1.60..397.11 rows=50 width=16) (actual time=0.141..239.702 rows=10000 loops=1)
Merge Cond: ((b.data ->> 'a'::text) = ((a.id)::text))
-> Index Scan using b_a_text on b (cost=0.29..295.29 rows=10000 width=79) (actual time=0.024..76.033 rows=10000 loops=1)
-> Sort (cost=1.31..1.32 rows=1 width=16) (actual time=0.080..0.105 rows=1 loops=1)
Sort Key: ((a.id)::text)
Sort Method: quicksort Memory: 25kB
-> Index Only Scan using a_pkey on a (cost=0.29..1.30 rows=1 width=16) (actual time=0.029..0.048 rows=1 loops=1)
Index Cond: (id = 'acfa1f62-c271-4ad4-9daf-818c4921fb0e'::uuid)
Heap Fetches: 0
Planning Time: 0.371 ms
Execution Time: 1025.622 ms
(18 rows)
select count(*) from a
left join b on b.data->>'a' = a.id::text
left join c on c.data->>'b' = b.id::text
where a.id = 'acfa1f62-c271-4ad4-9daf-818c4921fb0e';
count
-------
19999
(1 row)
Time: 51.609 ms
Massive improvement with the uuid index. 1.044ms vs 8.894 ms which is way faster and we’re fully utilising the index we just created.
However, when getting all a
relations, performance wise it is similar when hashjoin
is disabled. The lack of column statistics for JSONB columns may be the culprit here.
Conclusion
Using the built-in type UUID
is more efficient at the cost of a slightly more complicated query writing, with the extra benefit of avoiding handling possible pitfalls when comparing UUID
s.