15-445 Project 3 - Query Execution

Bustub的查询

Binder - 产生AST，查询表示为BoundSelect
Planner - 产生query plan，树
Optimizer - 优化查询

EXPLAIN计划节点后的是输出架构

1	Projection { exprs=[#0.0, #0.1] } \| (__mock_table_1.colA:INTEGER, __mock_table_1.colB:INTEGER)

产生两列，整数类型

Bustub使用iterator模型，每个executer实现Next函数，这个函数返回1.单个tuple 2.null标识没有更多的tuple。每个executer实现一个循环，在循环中调用子executer的Next来获得tuple并处理。
每个executer出了返回tuple还有RID作为tuple的标识符。

Task #1 - Access Method Executors

SeqScan

Hint: Make sure that you understand the difference between the pre-increment and post-increment operators when using the TableIterator object. You might get strange output if you confuse ++iter with iter++. (Check here for a quick refresher.)

实现扫描table需要用到TableIterator，可以用TableHeap::MakeIterator()来构造：

void SeqScanExecutor::Init() {
  auto *catalog = exec_ctx_->GetCatalog();
  auto *table_info = catalog->GetTable(plan_->GetTableOid());
  auto &table_heap = table_info->table_;
  it_ = std::make_unique<TableIterator>(table_heap->MakeIterator());
}

auto SeqScanExecutor::Next(Tuple *tuple, RID *rid) -> bool {
  if (it_->IsEnd()) {
    return false;
  }
  while (true) {
    auto res = it_->GetTuple();
    ++(*it_);
    if (res.first.is_deleted_) {
      continue;
    }
    *tuple = res.second;
    *rid = res.second.GetRid();
    break;
  }
  return true;
}

Insert

Hint: See the Index Updates section below for further details about updating a table’s indexes.

在Insert中，可以使用TableHeap::InsertTuple，但是根据Hint，我们需要同时更新对应的TableIndex，一个table可能包括多个index
Insert的内容就从child_plan中获取

Update

Hint: To implement an update, first delete the affected tuple and then insert a new tuple. Do not use the TableHeap UpdateTupleInplaceUnsafe function unless you are implementing leaderboard optimizations for project 4.

Update需要先删除然后添加

删除是GetTupleMeta后再Update，同时要删除Index
添加同样要添加Index

Delete

delete是一半的Update，设置TupleMeta即可（也要删除Index）

IndexScan

Bustub中，select * from t1 order by v1;如果已经有对v1的索引，会直接优化成为IndexScan，在现在的实现中就是B+树的Scan，这里发现了之前留下的两个坑

IndexIterator的是一个trivial copy的类（直接保存了Page*而不是PageGuard），复制直接用了memcpy，但是复制后需要析构的对象释放了对应page的read lock，导致lock和unlock没有对应
BPlusTree::Remove的时候，如果只剩一个root节点，并且root节点为空，需要删除掉这个空节点，否则与Iterator配合使用时，无法正确判断Iterator是否到End（因为Page*并不是nullptr，!=tree.End()）

Task #2 - Aggregation & Join Executors

Task 2中需要实现Aggregation和Join

Aggregation

SimpleAggregationHashTable 维护一张 hashmap，键为 AggregateKey，值为 AggregateValue，均为 std::vector<Value>。key 代表 group by 的字段的数组，value 则是需要 aggregate 的字段的数组。在下层算子传来一个 tuple 时，将 tuple 的 group by 字段和 aggregate 字段分别提取出来，调用 InsertCombine() 将 group by 和 aggregate 的映射关系存入 SimpleAggregationHashTable。若当前 hashmap 中没有 group by 的记录，则创建初值；若已有记录，则按 aggregate 规则逐一更新所有的 aggregate 字段，例如取 max/min，求 sum 等等。例如下面这条 sql：

1	SELECT min(t.z), max(t.z), sum(t.z) FROM t GROUP BY t.x, t.y;

group by（AggregateKey）为 {t.x, t.y}，aggregate（AggregateValue）为 {t.z, t.z, t.z}。aggregate 规则为 {min, max, sum}。

在Next()中需要返回一个个完成聚合操作后的tuple，所以需要在Init时遍历完child_executer

EXPLAIN SELECT MIN(colB), colA FROM __mock_table_1 GROUP BY colA;
 === OPTIMIZER ===                     
 Projection { exprs=[#0.1, #0.0] } | (<unnamed>:INTEGER, __mock_table_1.colA:INTEGER)                     
   Agg { types=[min], aggregates=[#0.1], group_by=[#0.0] } | (__mock_table_1.colA:INTEGER, agg#0:INTEGER) 
     MockScan { table=__mock_table_1 } | (__mock_table_1.colA:INTEGER, __mock_table_1.colB:INTEGER) 

EXPLAIN SELECT colA, MIN(colB) FROM __mock_table_1 GROUP BY colA;
 === OPTIMIZER ===                                                                                          
 Agg { types=[min], aggregates=[#0.1], group_by=[#0.0] } | (__mock_table_1.colA:INTEGER, <unnamed>:INTEGER) 
   MockScan { table=__mock_table_1 } | (__mock_table_1.colA:INTEGER, __mock_table_1.colB:INTEGER)

Next的返回值中总是key在前边，aggregate value在后边

NestedLoopJoin

nlj中，在Init中取出来所有的right_tuples，然后在Next中匹配
需要注意在Next中怎么处理来保存当前的状态，类似于python generator

HashJoin

当join的条件都是相等时，可以使用hash table而不是两次遍历

Optimizing NestedLoopJoin to HashJoin

新增一个optimizer规则，这个可以从nlj_as_hash_join.cpp抄

1	select * from A inner join B where B.c1 == A.c2;

这句话中，LeftPlan是A的SeqScan，RightPlan是B的SeqScan
expression中左儿子是B.c1，右儿子是A.c2，根据TupleIdx区分，在转换的时候需要注意

Task #3 - Sort + Limit Executors and Top-N Optimization

Sort

sort按照多个关键词排序，这里有点绕多写了几行

std::sort(sorted_.begin(), sorted_.end(),
          [this](const decltype(sorted_)::value_type &lhs, const decltype(sorted_)::value_type &rhs) {
            const auto &l_key = lhs.first;
            const auto &r_key = rhs.first;
            BUSTUB_ASSERT(l_key.keys_.size() == r_key.keys_.size(), "key size not match");
            for (uint32_t i = 0; i < l_key.keys_.size(); i++) {
              const auto &l = l_key.keys_[i];
              const auto &r = r_key.keys_[i];
              auto t = this->plan_->order_bys_[i].first;
              bool less = l.CompareLessThan(r) == CmpBool::CmpTrue;
              bool greater = l.CompareGreaterThan(r) == CmpBool::CmpTrue;
              bool desc = t == OrderByType::DESC;
              bool asc = t == OrderByType::ASC || t == OrderByType::DEFAULT;
              if ((less && desc) || (greater && asc)) {
                return false;
              }
              if ((greater && desc) || (less && asc)) {
                return true;
              }
            }
            return true;
          });

TopN

用优先队列实现TopN，反转下比较函数

Optimization

之后补充

ADDITIONAL INFORMATION

这一个实验中的Additional Information中详细介绍了我们可以使用的工具

System Catalog

catalog用来维护数据库的元数据，GetTable() GetIndex()可以获得表和索引
catalog中保存了数据库中table和index的映射：一个table可以包含多个index