0%

DataBase Review

整理学习到的databse知识。

Storage

Hierachy (speed fast to slow):

​ CPU Register -> Cache -> Main Memory -> Disk -> Tapes

Disk Capacity

​ Disk Capacity = #tracks per surface x (2 x #platter) x #sector per track x byte per sector.

  • Access time = seek time + rotational delay + transfer time + other delay

    • Seek time: time to move the arm to position disk head on the right track.
    • Rotational delay: time to wait for sector to rotate under the disk head.
    • Average rotational delay: time for 1/2 revolution.
    • transfer time: block size / transfer rate.
    Buffer Management

​ Pages in memory is like blocks in hard disk.

Example 1
1
2
3
Given a total revolution of 7200 RPM.
One rotation = 1/(7200/60s)*1000ms = 8.33ms
Average rotational latency = 4.16ms
Example 2
1
2
3
4
5
6
7
8
9
3600 RPM = 60 revolutions/s -> 1 rev = 16.66 ms
128 cylinders = 2^7
10% overhead between blocks
time over useful data = 16.66 x 0.9 = 14.99ms
time over gaps = 16.66 x 0.1 = 1.66ms

T1 = time to read one random block
T1 = seek + rotational delay + transfer time
T1 = 25 + 16.66/2 + 0.117 = 33.45ms

Record Layout

Conventional Indexes

Types:
  1. Dense index: index record appears for every search-key value
  2. Sparse index: contains index records for only some search-key value
  3. Multilevel index: if primary index does not fit in memory, access becomes expensive.
    1. Outer index: a sparse index of primary index.
    2. Inner index: the primary index file
  4. Duplicate keys:
    1. Dense index: only keep record in index for first data record with each search key value.
    2. Sparse index: has key-pointer pairs corresponding to the first search key on each block of the data file.

B+ Tree

Characteristics:

​ full(packed) and balanced(almost uniform height).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Consider a DBMS has following characteristics: Pointers:4 bytes. Keys: 4 bytes. Blocks:100 bytes.
Compute the maximum number of records we can index with:
a).2-level B Tree
b).2-level B+ Tree

a).2-level B Tree
Find largest integer value of n such that 4n+4(2n+1)<100. n=8.
Root has 8 keys + 8 record pointers + 9 children pointers
= 8 x 4 + 8 x 4 + 9 x 4 = 100bytes.
Each of 9 children has 12 records pointers
= 12 x (4 + 4) + 4 = 100bytes.
So maximum # of records = 12 x 9 + 8 = 116.

b).2-level B+ Tree
Find largest integer value of n such that 4n+(4n+1)<100. n=12.
Root has 12 keys + 13 children pointers
= 12 x 4 + 13 x 4 = 100 bytes.
Each of 13 children has 12 record pointers
= 12 x (4 + 4) + 4 = 100 bytes.
So maximum # of records = 13 x 12 = 156.

B Tree

Hashing

Characteristics:

$$
Avg\ occupancy = \frac{r}{n\times\gamma}
$$

n: The number of buckets that is currently in use.

r: current number of search keys stored in the buckets.

$\gamma$: block size(# search keys that can be stored in 1 block).

Relational Algebra

Algebra expressions / Translate SQL into relational algebra / Equivlaence.

Query Process
  • parsing: translating the query into a parsing tree.
  • query rewrite: the parse tree is transformed into an expression tree of relational algebra (logical query plan)
  • physical plan generation: translate the logical plan into a physical plan
    • select algorithms to implement each operator
    • choose order of operations.
Physical Query Plan
  • Sequential-Scan
    • Read tuples from the relation R by reading data blocks - one block at a time from disk to memory.
  • Index-Scan
    • The relation R must have an index
    • The index is scanned to find blocks that contain the desired tuples.
    • All the blocks containing desired tuples are then read into the memory - one block at a time.
Clustered and unclustered files/relations/indexes
  • Clustered file: A file that stores records of one relation.
  • Unclustered file: A file that stores records of multiple relations.
  • Clustered index: Index that allows tuples to be read in an order that corresponds to physical order.
  • Relation R is clustered.
Operator I/O cost Explanation
Sequential-Scan B(R) Read all blocks, and there are B(R) blocks.
Index-Scan ≤B(R) Depends on number of values in the index is scanned.
  • Relation R is unclustered.
Operator I/O cost Explanation
Sequential-Scan ~T(R) Assuming next tuple is not found in the current block. We will read 1 block per tuple
Index-Scan ≤T(R) Depends on number of values in the index is scanned.
SQL grammer
1
2
3
4
5
6
7
8
9
10
11
12
<Query> ::= SELECT <SelList>
FROM <From List>
WHERE <Condition>

<SelList> ::= <Attribute> | <Attribute> , <SelList>

<FromList> ::= <Relation> | <Relation> , <FromList>

<Condition> ::= <Condition> AND <Condition> |
<Attribute> IN (<Query>) |
<Attribute> = <Attribute>
<Attribute> LIKE <Pattern>
Example of parse trees
1
2
3
4
SELECT movieTitle
FROM StarsIn, MovieStar
WHERE starName = name
AND birthdate LIKE '%1960'

Example 2 of parse trees
1
2
3
4
5
SELECT movieTitle
FROM StarsIn
WHERE starName IN (SELECT name
FROM MovieStar
WHERE birthdate LIKE '%1960')

example2OfParseTrees

Virtual Relation Pre-Processing
1
2
3
4
CREATE VIEW Paramount_Movies AS
(SELECT title, year
FROM Movies
WHERE StudioName = 'Paramount')

image-20191206021435872

Consider the following query on the virtual table Paramount_Movies:
1
2
3
SELECT	title
FROM Paramount_Movies
WHERE year = 1979

image-20191206021508616

Then, the virtual table is replaced by the corresponding logical query plan

image-20191206021527531

Convert Parse Tree into initial L.Q.P

convertParseTreeIntoInitialLQP

Set vs Bag semantics
  • Set semantics: No duplicate entries.
  • Bag semantics: Allow duplicate entries.
Operators
  • Selection
    • Syntax: σ c (R)
    • Set: { t | t ∈ R AND t fulfills c}
    • Bag: { $t^n$ | $t^n$ ∈ R AND t fulfills c}
  • Renaming
    • Syntax: ρ A (R)
    • Set: { t.A | t∈R}
    • Bag: { $(t.A)^n$ | $t^n$ ∈R}
    • Example: $ρ_c$←a(R)
  • Projection
    • Syntax: π A (R)
    • Set: { t.A | t∈R}
    • Bag: { $(t.A)^n$ | $t^n$ ∈R}
  • Joins
    • Theta, natural, cross-product, outer, anti
    • Cross Product
      • Syntax: R × S
      • Set: { (t,s) | t∈R AND s∈S}
      • Bag: { (t,s)^n×m | t^n ∈R AND s^m ∈S}
    • Join
      • Syntax: $R\bowtie_{c}S$
      • Set: { (t,s) | t∈R AND s∈S AND (t,s) matches c}
      • Bag: { $(t,s)^{n×m}$ |$t^n$ ∈R AND $s^m$ ∈S AND (t,s) matches c}
    • Natural Join
      • Syntax: $R\bowtie S$
      • All combinations of tuples from R and S that match on common attributes
      • A = common attributes of R and S
      • C = exclusive attributes of S
      • Set: {(t,s.C) | t∈R AND s∈S AND t.A=s.A}
      • Bag: {$(t,s.C)^{n×m}$ | $t^n$ ∈R AND $s^m$ ∈S AND t.A=s.A}
    • Left-outer Join
      • Syntax:$R=\bowtie{c} S$
      • Contains all tuples in R join S
      • Also includes every tuple in R that is not joined with a tuple in S, after padding a NULL
      • $R=\bowtie _{a=d}S$
      • left_outer_join_example
    • Right-outer Join
      • Syntax:$R\bowtie =S$
      • s∈S without match, fill R attributes with NULLrigh
      • $R\bowtie =_{a=d}S$
      • right_outer_join_example
    • Full-outer Join
      • Syntax: $R=\bowtie =_{c}S$
      • $R=\bowtie=_{a=d}S$
      • full_outer_join_example
  • Aggregation
    • All operators treat a relation as a bag of tuples.
    • SUM: computes the sum of a column with numerical values.
    • AVG: computes the average of a column with numerical values.
    • MIN and MAX: Natural Order
    • COUNT: computes the number of non-NULL tuples in a column.
    • Grouping and aggregation generally need to be implemented and optimized together
    • Syntax: $G\gamma A(R)$
      • A is list of aggregation functions
      • G is list of group by attributes
    • Build groups of tuples according G and compute the aggregation functions from each group
    • $b\gamma sum(a)(R)$
    • aggregation_example
  • Duplicate removal
    • Syntax: $\delta(R)$: is the relation containing exactly one copy of each tuple in R.
    • $\delta(R)$
    • duplicate_removal
  • Set operations.
  • Union, Intersection and Difference
    • Union
      • Syntax: $R\cup S$
      • unionExample
    • Intersection
      • Syntax: $R\cap S$
      • intersectionExample
    • Set Difference
      • Syntax: $R-S$
      • setExample

Canonical Translation to Relational Algebra

  • FROM clause into joins and crossproducts
  • WHERE clause into selection
  • SELECT clause into projection and renaming
    • If it has aggregation functions use aggregation
    • DISTINCT into duplicate removal
  • GROUP BY clause into aggregation
  • HAVING clause into selection
  • ORDER BY - no counter-part

Set Operations

  • UNION ALL into union
  • UNION duplicate removal over union
  • INTERSECT ALL into intersection
  • INTERSECT add duplicate removal
  • EXCEPT ALL into set difference
  • EXCEPT apply duplicate removal to inputs and then apply set difference

Conversion into Relational Algebra

Example 1
1
SELECT name FROM MovieStar WHERE birthdate LIKE '%1960';

image-20191206021614275

Example 2
1
2
3
SELECT movieTitle
FROM StarsIn, MovieStar
WHERE starName = name AND birthdate LIKE '%1960';

image-20191206021630157

Example 3
1
2
3
SELECT sum(a)
FROM R
GROUP BY b

image-20191206021643102

Example 4
1
2
3
4
5
SELECT dep, headcnt
FROM (SELECT count(*) AS headcnt, dep
FROM Employee
GROUP BY dep)
WHERE headcnt > 100

image-20191206021656629

Example 5
1
SELECT title FROM StarsIn WHERE starName IN (<Query>);

conversion_into_relationExample5

Query Processing and Optimization

Query Optimization

image-20191206022236925

Relational Algebra Optimization
Algebraic Laws:
  • Joins and Products
    • Natural joins and product are both associative and commutative
    • theta-join is commutative but not always associative.
  • Union and Intersect
    • Union and intersection are both associative and commutative
  • Laws for Bags and Sets can differ
    • Holds for sets, but guarentee for bags
  • Select
    • reduces the number of tuples (size of relation)
    • an important general rule in optimization is to push selects as far down the tree as possible
    • commutative
    • must be pushed through both arguments for union:
      • Union: σ p (R ∪ S) = σ p (R) ∪ σ p (S)
    • must be pushed through first arguments, optionally second for difference:
      • Difference: σ p (R - S) = σ p (R) - S
    • For the other operators it is only required that the selection be pushed to one argument.
      • σ p (R × S) = σ p (R) × S, if p contains only attributes from R
    • Derivation
      • image-20191206024632637
    • Amendment to the simple push down the select
      • Use this algebraic law in the reverse order: $\sigma _p(R\bowtie S) = \sigma _p(R)\bowtie S$ to push the σ year=1996 operation up the tree
      • image-20191206025216352
      • Then use the algebraic law in the forward order: $\sigma _p(R\bowtie S) = \sigma _p(R)\bowtie \sigma _p(S)$ push the $\sigma _{year=1996}$ operation down the tree.
      • image-20191206025413264
  • Projection
    • image-20191206025519667
  • Example
    • image-20191206030745130
    • image-20191206030801787
    • image-20191206030815185
    • image-20191206030824074
    • image-20191206030838597
  • Summary
    • Selections
      • push down tree as far as possible
      • if condition is an AND, split and push separately
      • sometimes need to push up before pushing down
    • Projections
      • can be pushed down
      • new ones can be added (but be careful)
    • duplicate eliminations can sometimes be removed (e.g., if one key)
    • Selection/product combinations
      • can sometimes be replaced with join
    • Many transformations lead to “promising” plans