DataBase Review

2019-12-05

Word count: 1.8k | Reading time≈ 10 min

整理学习到的databse知识。

Storage

Hierachy (speed fast to slow):

CPU Register -> Cache -> Main Memory -> Disk -> Tapes

Disk Capacity

Disk Capacity = #tracks per surface x (2 x #platter) x #sector per track x byte per sector.

Access time = seek time + rotational delay + transfer time + other delay
- Seek time: time to move the arm to position disk head on the right track.
- Rotational delay: time to wait for sector to rotate under the disk head.
- Average rotational delay: time for 1/2 revolution.
- transfer time: block size / transfer rate.
Buffer Management

Pages in memory is like blocks in hard disk.

Example 1

1
2
3

Given a total revolution of 7200 RPM.
One rotation = 1/(7200/60s)*1000ms = 8.33ms
Average rotational latency = 4.16ms

Example 2

3600 RPM = 60 revolutions/s -> 1 rev = 16.66 ms
128 cylinders = 2^7
10% overhead between blocks
time over useful data = 16.66 x 0.9 = 14.99ms
time over gaps = 16.66 x 0.1 = 1.66ms

T1 = time to read one random block
T1 = seek + rotational delay + transfer time 
T1 = 25 + 16.66/2 + 0.117 = 33.45ms

Record Layout

Conventional Indexes

Types:

Dense index: index record appears for every search-key value
Sparse index: contains index records for only some search-key value
Multilevel index: if primary index does not fit in memory, access becomes expensive.
1. Outer index: a sparse index of primary index.
2. Inner index: the primary index file
Duplicate keys:
1. Dense index: only keep record in index for first data record with each search key value.
2. Sparse index: has key-pointer pairs corresponding to the first search key on each block of the data file.

B+ Tree

Characteristics:

full(packed) and balanced(almost uniform height).

Consider a DBMS has following characteristics: Pointers:4 bytes. Keys: 4 bytes. Blocks:100 bytes.
Compute the maximum number of records we can index with:
a).2-level B Tree
b).2-level B+ Tree

a).2-level B Tree
Find largest integer value of n such that 4n+4(2n+1)<100. n=8.
Root has 8 keys + 8 record pointers + 9 children pointers 
= 8 x 4 + 8 x 4 + 9 x 4 = 100bytes.
Each of 9 children has 12 records pointers
= 12 x (4 + 4) + 4 = 100bytes.
So maximum # of records = 12 x 9 + 8 = 116.

b).2-level B+ Tree
Find largest integer value of n such that 4n+(4n+1)<100. n=12.
Root has 12 keys + 13 children pointers
= 12 x 4 + 13 x 4 = 100 bytes.
Each of 13 children has 12 record pointers
= 12 x (4 + 4) + 4 = 100 bytes.
So maximum # of records = 13 x 12 = 156.

B Tree

Hashing

Characteristics:

$$
Avg\ occupancy = \frac{r}{n\times\gamma}
$$

n: The number of buckets that is currently in use.

r: current number of search keys stored in the buckets.

$\gamma$: block size(# search keys that can be stored in 1 block).

Relational Algebra

Algebra expressions / Translate SQL into relational algebra / Equivlaence.

Query Process

parsing: translating the query into a parsing tree.
query rewrite: the parse tree is transformed into an expression tree of relational algebra (logical query plan)
physical plan generation: translate the logical plan into a physical plan
- select algorithms to implement each operator
- choose order of operations.

Physical Query Plan

Sequential-Scan
- Read tuples from the relation R by reading data blocks - one block at a time from disk to memory.
Index-Scan
- The relation R must have an index
- The index is scanned to find blocks that contain the desired tuples.
- All the blocks containing desired tuples are then read into the memory - one block at a time.

Clustered and unclustered files/relations/indexes

Clustered file: A file that stores records of one relation.
Unclustered file: A file that stores records of multiple relations.
Clustered index: Index that allows tuples to be read in an order that corresponds to physical order.
Relation R is clustered.

Operator	I/O cost	Explanation
Sequential-Scan	B(R)	Read all blocks, and there are B(R) blocks.
Index-Scan	≤B(R)	Depends on number of values in the index is scanned.

Relation R is unclustered.

Operator	I/O cost	Explanation
Sequential-Scan	~T(R)	Assuming next tuple is not found in the current block. We will read 1 block per tuple
Index-Scan	≤T(R)	Depends on number of values in the index is scanned.

SQL grammer

<Query> ::= SELECT <SelList>
						FROM   <From List>
						WHERE  <Condition>
						
<SelList> ::= <Attribute> | <Attribute> , <SelList>

<FromList> ::= <Relation> | <Relation> , <FromList>

<Condition> ::= <Condition> AND <Condition>  |
								<Attribute> IN (<Query>)     |
								<Attribute> = <Attribute>
								<Attribute> LIKE <Pattern>

Example of parse trees

SELECT movieTitle
FROM   StarsIn, MovieStar
WHERE  starName = name
       AND birthdate LIKE '%1960'

Example 2 of parse trees

SELECT movieTitle
FROM   StarsIn
WHERE  starName IN (SELECT name
                    FROM   MovieStar
                    WHERE  birthdate LIKE '%1960')

example2OfParseTrees

Virtual Relation Pre-Processing

CREATE VIEW Paramount_Movies AS
	(SELECT title, year
	FROM   Movies
	WHERE  StudioName = 'Paramount')

Consider the following query on the virtual table Paramount_Movies:

1
2
3

SELECT	title
FROM		Paramount_Movies
WHERE		year = 1979

Then, the virtual table is replaced by the corresponding logical query plan

Convert Parse Tree into initial L.Q.P

convertParseTreeIntoInitialLQP

Set vs Bag semantics

Set semantics: No duplicate entries.
Bag semantics: Allow duplicate entries.

Operators

Selection
- Syntax: σ c (R)
- Set: { t | t ∈ R AND t fulfills c}
- Bag: { $t^n$ | $t^n$ ∈ R AND t fulfills c}
Renaming
- Syntax: ρ A (R)
- Set: { t.A | t∈R}
- Bag: { $(t.A)^n$ | $t^n$ ∈R}
- Example: $ρ_c$←a(R)
Projection
- Syntax: π A (R)
- Set: { t.A | t∈R}
- Bag: { $(t.A)^n$ | $t^n$ ∈R}
Joins
- Theta, natural, cross-product, outer, anti
- Cross Product
  - Syntax: R × S
  - Set: { (t,s) | t∈R AND s∈S}
  - Bag: { (t,s)^n×m | t^n ∈R AND s^m ∈S}
- Join
  - Syntax: $R\bowtie_{c}S$
  - Set: { (t,s) | t∈R AND s∈S AND (t,s) matches c}
  - Bag: { $(t,s)^{n×m}$ |$t^n$ ∈R AND $s^m$ ∈S AND (t,s) matches c}
- Natural Join
  - Syntax: $R\bowtie S$
  - All combinations of tuples from R and S that match on common attributes
  - A = common attributes of R and S
  - C = exclusive attributes of S
  - Set: {(t,s.C) | t∈R AND s∈S AND t.A=s.A}
  - Bag: {$(t,s.C)^{n×m}$ | $t^n$ ∈R AND $s^m$ ∈S AND t.A=s.A}
- Left-outer Join
  - Syntax:$R=\bowtie{c} S$
  - Contains all tuples in R join S
  - Also includes every tuple in R that is not joined with a tuple in S, after padding a NULL
  - $R=\bowtie _{a=d}S$
- Right-outer Join
  - Syntax:$R\bowtie =S$
  - s∈S without match, ﬁll R attributes with NULLrigh
  - $R\bowtie =_{a=d}S$
- Full-outer Join
  - Syntax: $R=\bowtie =_{c}S$
  - $R=\bowtie=_{a=d}S$
Aggregation
- All operators treat a relation as a bag of tuples.
- SUM: computes the sum of a column with numerical values.
- AVG: computes the average of a column with numerical values.
- MIN and MAX: Natural Order
- COUNT: computes the number of non-NULL tuples in a column.
- Grouping and aggregation generally need to be implemented and optimized together
- Syntax: $G\gamma A(R)$
  - A is list of aggregation functions
  - G is list of group by attributes
- Build groups of tuples according G and compute the aggregation functions from each group
- $b\gamma sum(a)(R)$
Duplicate removal
- Syntax: $\delta(R)$: is the relation containing exactly one copy of each tuple in R.
- $\delta(R)$
Set operations.
Union, Intersection and Difference
- Union
  - Syntax: $R\cup S$
- Intersection
  - Syntax: $R\cap S$
- Set Difference
  - Syntax: $R-S$

Canonical Translation to Relational Algebra

FROM clause into joins and crossproducts
WHERE clause into selection
SELECT clause into projection and renaming
- If it has aggregation functions use aggregation
- DISTINCT into duplicate removal
GROUP BY clause into aggregation
HAVING clause into selection
ORDER BY - no counter-part

Set Operations

UNION ALL into union
UNION duplicate removal over union
INTERSECT ALL into intersection
INTERSECT add duplicate removal
EXCEPT ALL into set diﬀerence
EXCEPT apply duplicate removal to inputs and then apply set diﬀerence

Conversion into Relational Algebra

Example 1

1	SELECT name FROM MovieStar WHERE birthdate LIKE '%1960';

Example 2

1
2
3

SELECT movieTitle
FROM   StarsIn, MovieStar
WHERE  starName = name AND birthdate LIKE '%1960';

Example 3

1
2
3

SELECT sum(a)
FROM R
GROUP BY b

Example 4

SELECT dep, headcnt
FROM (SELECT count(*) AS headcnt, dep
     FROM Employee
     GROUP BY dep)
WHERE headcnt > 100

Example 5

1	SELECT title FROM StarsIn WHERE starName IN (<Query>);

conversion_into_relationExample5

Query Processing and Optimization

Query Optimization

Relational Algebra Optimization

Algebraic Laws:

Joins and Products
- Natural joins and product are both associative and commutative
- theta-join is commutative but not always associative.
Union and Intersect
- Union and intersection are both associative and commutative
Laws for Bags and Sets can differ
- Holds for sets, but guarentee for bags
Select
- reduces the number of tuples (size of relation)
- an important general rule in optimization is to push selects as far down the tree as possible
- commutative
- must be pushed through both arguments for union:
  - Union: σ p (R ∪ S) = σ p (R) ∪ σ p (S)
- must be pushed through ﬁrst arguments, optionally second for diﬀerence:
  - Diﬀerence: σ p (R - S) = σ p (R) - S
- For the other operators it is only required that the selection be pushed to one argument.
  - σ p (R × S) = σ p (R) × S, if p contains only attributes from R
- Derivation
- Amendment to the simple push down the select
  - Use this algebraic law in the reverse order: $\sigma _p(R\bowtie S) = \sigma _p(R)\bowtie S$ to push the σ year=1996 operation up the tree
  - Then use the algebraic law in the forward order: $\sigma _p(R\bowtie S) = \sigma _p(R)\bowtie \sigma _p(S)$ push the $\sigma _{year=1996}$ operation down the tree.
Projection
Example
Summary
- Selections
  - push down tree as far as possible
  - if condition is an AND, split and push separately
  - sometimes need to push up before pushing down
- Projections
  - can be pushed down
  - new ones can be added (but be careful)
- duplicate eliminations can sometimes be removed (e.g., if one key)
- Selection/product combinations
  - can sometimes be replaced with join
- Many transformations lead to “promising” plans