no zuo no die系列,来自于pg的wiki。
这是第四部分,不要使用NOT IN。
理由是:
1.NOT IN behaves in unexpected ways if there is a null present
2.Because of point 1 above, NOT IN (SELECT …) does not optimize very well. In particular, the planner can’t transform it into an anti-join, and so it becomes either a hashed Subplan or a plain Subplan. The hashed subplan is fast, but the planner only allows that plan for small result sets; the plain subplan is horrifically slow (in fact O(N²)). This means that the performance can look good in small-scale tests but then slow down by 5 or more orders of magnitude once a size threshold is crossed; you do not want this to happen.
第一个理由是如存在null的话,会有未如预期的结果。
[local]:5432 pg12@testdb=# select * from tbl;
id | value
----+-------
1 | 2
(1 row)
Time: 1.359 ms
[local]:5432 pg12@testdb=# select * from tbl where id not in (1,null);
id | value
----+-------
(0 rows)
Time: 1.443 ms
原因是任意值跟null的运算结果都是null,不是false也不是true,因此没有结果返回。
[local]:5432 pg12@testdb=# create table t_null(id int);
CREATE TABLE
Time: 168.932 ms
[local]:5432 pg12@testdb=# insert into t_null values(1);
INSERT 0 1
Time: 6.136 ms
[local]:5432 pg12@testdb=# insert into t_null values(null);
INSERT 0 1
Time: 2.611 ms
[local]:5432 pg12@testdb=# select * from tbl a where a.id not in (select b.id from t_null b);
id | value
----+-------
(0 rows)
Time: 6.682 ms
[local]:5432 pg12@testdb=#
第二个理由是在结果集很大的情况下会有性能问题(实际上,o(  ̄︶ ̄)o)
[local]:5432 pg12@testdb=# explain verbose select * from tbl a where a.id not in (select b.id from t_big_null b);
QUERY PLAN
---------------------------------------------------------------------------------------------
Seq Scan on public.tbl a (cost=0.00..291892417.45 rows=1130 width=8)
Output: a.id, a.value
Filter: (NOT (SubPlan 1))
SubPlan 1
-> Materialize (cost=0.00..233311.72 rows=10000048 width=4)
Output: b.id
-> Seq Scan on public.t_big_null b (cost=0.00..144248.48 rows=10000048 width=4)
Output: b.id
(8 rows)
Time: 2.552 ms
[local]:5432 pg12@testdb=# explain verbose select * from tbl a where not exists (select b.id from t_big_null b where a.id = b.id);
QUERY PLAN
-------------------------------------------------------------------------------------------
Hash Anti Join (cost=308312.08..347459.58 rows=1 width=8)
Output: a.id, a.value
Hash Cond: (a.id = b.id)
-> Seq Scan on public.tbl a (cost=0.00..32.60 rows=2260 width=8)
Output: a.id, a.value
-> Hash (cost=144248.48..144248.48 rows=10000048 width=4)
Output: b.id
-> Seq Scan on public.t_big_null b (cost=0.00..144248.48 rows=10000048 width=4)
Output: b.id
(9 rows)
Time: 2.960 ms
实际执行SQL:
[local]:5432 pg12@testdb=# select * from tbl a where a.id not in (select b.id from t_big_null b);
id | value
----+-------
(0 rows)
Time: 2.176 ms
[local]:5432 pg12@testdb=# select * from tbl a where not exists (select b.id from t_big_null b where a.id = b.id);
id | value
----+-------
(0 rows)
Time: 3540.645 ms (00:03.541)
[local]:5432 pg12@testdb=#
执行计划,not exists的成本比not in要低,但实际上,not in的执行要比not exists要快很多!
参考资料
Don’t Do This