Lit / Kn65

Kn65. Donald E. Knuth. On the Translation of Languages from Left to Right

LitD:Kn65.pdf, Kn65↗ ⊥
Donald E. Knuth
1965
On the Translation of Languages from Left to Right
INFORMATION AND CONTROL 8, 607-639 (1965)
my python implementation

LR(k) Grammatik, Parsing und grundlegenden Eigenschaften

LR(k)

L = Left to right parsing
R = Reverse, Rightmost Derivation (Auflösung von Rules, Tree Linearisierung: immer rechtestes NonTerminal auflösen)
k Lookahead Terminals for unique parsing

i Intro and Defs

"intermediates" I (Rules): A, B, C
"terminals" T: a, b, c
S denotes the "principal intermediate character"
ε empty string
production is a relation A → Θ mit Θ Strings über I ∪ T
Grammar G: {A_p → X_p1 ... X_pnp | rulenumber p ∈ [1,s], n_p >= 0 }
φ → ψ direkte Ableitung: ein Intermediat aus φ ersetzen durch eine Produktion
φ ⇒ ψ transitiver Abschluss von →
φ ⇛ ψ transitiver+reflexiver Abschluss von → i.e. φ → ψ or φ = ψ
language einer Grammatik: {α String über T | S ⇒ α}
sentential form: any string α for which S ⇒ α
derivation tree or parse diagram
- Linearisierungen indem ein Intermediate nach dem anderen expandiert wird, Reihenfolge ist irrelevant, also z.B. leftmost oder rightmost auswählen
"handle of a tree" to be the leftmost set of adjacent leaves forming a complete branch
"pruning off" handles (d.h. leaves der handle entfernen) ==> Schritt für Schritt zurück nach S <==> rightmost derivation in revers
(n, p) is "handle of α = X₁ ... X_n ... X_t (String over T ∪ I)" iff ∃ derivation tree von α mit der handle X_r+1 ... X_n for production p
"k-sentential form" is a sentential form followed by k ⊣ with ⊣ ∉ T ∪ I
a grammar is "LR(k)" iff for any k-sentiential
- iff any handle is always uniquely determined by the string to its left and the k terminal characters to its right. Formally: iff ∀
- α = X₁ ... X_n ... X_n+k Y₁ ... Y_u is a k-sentential form without Intermediates from X_n+1 to Y_u
- β = X₁ ... X_n ... X_n+k Z₁ ... Z_v is a k-sentential form without Intermediates from X_n+1 to Z_v
- (n, p) handle of α and (n', p') handle of β
- then n = n' and p = p'
- this means we do all possible reductions at n, and then step one forward - no backtracking needed
- remarks
  - handle implies, no reductions left of n, however, we have to consider all possible trees
  - the pushdown automata the infinit number of possible prefixes (for n unlimited) to a finite number of combinations

ii. ANALYSIS OF LR(k) GRAMMARS

method 1: reduce to regular Grammar

define Grammar G' from G by

Terminals: T ∪ I ∪ {⊣}
Intermediates: [A, α] with α ∈ (T ∪ ⊣)^k and [p]
H_k(σ) = {α | ∃ β: σ ⇒ αβ}
rules
- [A_p, α] → X_p1 ... X_p(j-1) [X_pj, β] with j <= n_p and β = H_k(X_p(j+1) ... X_pnp α)
- [A_p, α] → X_p1 ... X_pnp α [p]

now, the following to are equivalent

[S, ⊣^k] ⇒ X₁ ... X_n ... X_n+k [p] in G'
there exists a k-sentential form of G: X₁ ... X_n ... X_n+k Y₁ ... Y_u with handle (n, p) and X_n+1 ... Y_u not intermediates

easy to see for Knuth, I don't see how to prove the left-most property of handle (may be is not necessary because it follows from the second property below?)

thus

G is LR(k) if and only if
[S, ⊣^k] ⇒ θ [p] and [S, ⊣^k] ⇒ θ φ [q] implies φ = ε and p = q in G'

but G' is regular and well known methods exist to check this

method 2: LR(k) pushdown parser

careful for emtpy productions A → ε! modify H_k(σ) to omit all derivations when an intermediate as the initial character is replaced by ε.

state = set of [p, j, α] meaning we have parsed β X_p1 ... X_pj and there is a sentential form β A_p α ...

stack: S₀X₁S₁X₂S₂ ... X_nS_n | Y₁ ... Y_k ω: left from the | are the already parsed characters and states, right follows the lookahead und the rest of the input

algorithm on stack as above:

step1 compute recursively closure S' of S' = S_n ∪ {[q, 0, β] | ∃ [p,j; α] in S', X_p(j+1) = q and β ∈ H_k(X_p(j+2) ... X_pnp α)} : add all productions that could newly start
step2 compute
- Z = {β | ∃ [p, j; alpha;] in S', j < n_p and β ∈ H_k(X_p(j+1) ... X_pnp α)} : in rule p, but not at end
- Z_p = {α | ∃ [p, n_p; α]} : at end of rule p
- if these sets are not disjoint, error: grammar is not LR(k)
- if lookahead Y₁ ... Y_k in Z then shift Y₁ onto stack ==> S₀X₁S₁X₂S₂ ... X_nS_nY₁ | Y₂...
- elif lookahead in Y_p then reduce stack by rule p ==> S₀X₁S₁X₂S₂ ... X_n-npS_n-npA_p | Y₁ ...
- else syntax error
step3 after renaming the stack has now the form S₀X₁S₁X₂S₂ ... X_nS_nX_n+1 | Y₁ ... Y_k ω
- compute S' from S_n as in step1 (add all productions that could newly start)
- compute S_n+1 = {[p, j+1, α] | [p, j; a] ∈ S' and X_pj = X_n+1}: new state after X_n+1
- if S_n+1 = {[0, 1, ⊢_k]} and lookahead = ⊢_k then parsing succesfully finished
- else push S_n+1 on the stack and go to step1