0%

15-445 Project 3 - Query Execution

Bustub的查询

Binder - 产生AST,查询表示为BoundSelect
Planner - 产生query plan,树
Optimizer - 优化查询

EXPLAIN计划节点后的是输出架构

1
Projection { exprs=[#0.0, #0.1] } | (__mock_table_1.colA:INTEGER, __mock_table_1.colB:INTEGER)

产生两列,整数类型

Bustub使用iterator模型,每个executer实现Next函数,这个函数返回1.单个tuple 2.null标识没有更多的tuple。每个executer实现一个循环,在循环中调用子executer的Next来获得tuple并处理。
每个executer出了返回tuple还有RID作为tuple的标识符。

Task #1 - Access Method Executors

SeqScan

Hint: Make sure that you understand the difference between the pre-increment and post-increment operators when using the TableIterator object. You might get strange output if you confuse ++iter with iter++. (Check here for a quick refresher.)

实现扫描table需要用到TableIterator,可以用TableHeap::MakeIterator()来构造:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
void SeqScanExecutor::Init() {
auto *catalog = exec_ctx_->GetCatalog();
auto *table_info = catalog->GetTable(plan_->GetTableOid());
auto &table_heap = table_info->table_;
it_ = std::make_unique<TableIterator>(table_heap->MakeIterator());
}

auto SeqScanExecutor::Next(Tuple *tuple, RID *rid) -> bool {
if (it_->IsEnd()) {
return false;
}
while (true) {
auto res = it_->GetTuple();
++(*it_);
if (res.first.is_deleted_) {
continue;
}
*tuple = res.second;
*rid = res.second.GetRid();
break;
}
return true;
}

Insert

Hint: See the Index Updates section below for further details about updating a table’s indexes.

在Insert中,可以使用TableHeap::InsertTuple,但是根据Hint,我们需要同时更新对应的TableIndex,一个table可能包括多个index
Insert的内容就从child_plan中获取

Update

Hint: To implement an update, first delete the affected tuple and then insert a new tuple. Do not use the TableHeap UpdateTupleInplaceUnsafe function unless you are implementing leaderboard optimizations for project 4.

Update需要先删除然后添加

  • 删除是GetTupleMeta后再Update,同时要删除Index
  • 添加同样要添加Index

Delete

delete是一半的Update,设置TupleMeta即可(也要删除Index)

IndexScan

Bustub中,select * from t1 order by v1;如果已经有对v1的索引,会直接优化成为IndexScan,在现在的实现中就是B+树的Scan,这里发现了之前留下的两个坑

  • IndexIterator的是一个trivial copy的类(直接保存了Page*而不是PageGuard),复制直接用了memcpy,但是复制后需要析构的对象释放了对应page的read lock,导致lock和unlock没有对应
  • BPlusTree::Remove的时候,如果只剩一个root节点,并且root节点为空,需要删除掉这个空节点,否则与Iterator配合使用时,无法正确判断Iterator是否到End(因为Page*并不是nullptr,!=tree.End())

Task #2 - Aggregation & Join Executors

Task 2中需要实现Aggregation和Join

Aggregation

SimpleAggregationHashTable 维护一张 hashmap,键为 AggregateKey,值为 AggregateValue,均为 std::vector<Value>。key 代表 group by 的字段的数组,value 则是需要 aggregate 的字段的数组。在下层算子传来一个 tuple 时,将 tuple 的 group by 字段和 aggregate 字段分别提取出来,调用 InsertCombine() 将 group by 和 aggregate 的映射关系存入 SimpleAggregationHashTable。若当前 hashmap 中没有 group by 的记录,则创建初值;若已有记录,则按 aggregate 规则逐一更新所有的 aggregate 字段,例如取 max/min,求 sum 等等。例如下面这条 sql:

1
SELECT min(t.z), max(t.z), sum(t.z) FROM t GROUP BY t.x, t.y;

group by(AggregateKey)为 {t.x, t.y},aggregate(AggregateValue)为 {t.z, t.z, t.z}。aggregate 规则为 {min, max, sum}

Next()中需要返回一个个完成聚合操作后的tuple,所以需要在Init时遍历完child_executer

1
2
3
4
5
6
7
8
9
10
EXPLAIN SELECT MIN(colB), colA FROM __mock_table_1 GROUP BY colA;
=== OPTIMIZER ===
Projection { exprs=[#0.1, #0.0] } | (<unnamed>:INTEGER, __mock_table_1.colA:INTEGER)
Agg { types=[min], aggregates=[#0.1], group_by=[#0.0] } | (__mock_table_1.colA:INTEGER, agg#0:INTEGER)
MockScan { table=__mock_table_1 } | (__mock_table_1.colA:INTEGER, __mock_table_1.colB:INTEGER)

EXPLAIN SELECT colA, MIN(colB) FROM __mock_table_1 GROUP BY colA;
=== OPTIMIZER ===
Agg { types=[min], aggregates=[#0.1], group_by=[#0.0] } | (__mock_table_1.colA:INTEGER, <unnamed>:INTEGER)
MockScan { table=__mock_table_1 } | (__mock_table_1.colA:INTEGER, __mock_table_1.colB:INTEGER)

Next的返回值中总是key在前边,aggregate value在后边

NestedLoopJoin

nlj中,在Init中取出来所有的right_tuples,然后在Next中匹配
需要注意在Next中怎么处理来保存当前的状态,类似于python generator

HashJoin

当join的条件都是相等时,可以使用hash table而不是两次遍历

Optimizing NestedLoopJoin to HashJoin

新增一个optimizer规则,这个可以从nlj_as_hash_join.cpp

1
select * from A inner join B where B.c1 == A.c2;

这句话中,LeftPlan是A的SeqScanRightPlan是B的SeqScan
expression中左儿子是B.c1,右儿子是A.c2,根据TupleIdx区分,在转换的时候需要注意

Task #3 - Sort + Limit Executors and Top-N Optimization

Sort

sort按照多个关键词排序,这里有点绕多写了几行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
std::sort(sorted_.begin(), sorted_.end(),
[this](const decltype(sorted_)::value_type &lhs, const decltype(sorted_)::value_type &rhs) {
const auto &l_key = lhs.first;
const auto &r_key = rhs.first;
BUSTUB_ASSERT(l_key.keys_.size() == r_key.keys_.size(), "key size not match");
for (uint32_t i = 0; i < l_key.keys_.size(); i++) {
const auto &l = l_key.keys_[i];
const auto &r = r_key.keys_[i];
auto t = this->plan_->order_bys_[i].first;
bool less = l.CompareLessThan(r) == CmpBool::CmpTrue;
bool greater = l.CompareGreaterThan(r) == CmpBool::CmpTrue;
bool desc = t == OrderByType::DESC;
bool asc = t == OrderByType::ASC || t == OrderByType::DEFAULT;
if ((less && desc) || (greater && asc)) {
return false;
}
if ((greater && desc) || (less && asc)) {
return true;
}
}
return true;
});

TopN

用优先队列实现TopN,反转下比较函数

Optimization

之后补充

ADDITIONAL INFORMATION

这一个实验中的Additional Information中详细介绍了我们可以使用的工具

System Catalog

catalog用来维护数据库的元数据,GetTable() GetIndex()可以获得表和索引
catalog中保存了数据库中table和index的映射:一个table可以包含多个index