Gramática libre de contexto

Este artículo necesita citas adicionales para su verificación . Por favor, ayuda a mejorar este artículo mediante la adición de citas de fuentes confiables . El material no obtenido puede ser cuestionado y eliminado.
Encuentre fuentes: "Gramática libre de contexto" - noticias · periódicos · libros · académico · JSTOR ( febrero de 2012 ) ( Aprenda cómo y cuándo eliminar este mensaje de plantilla )

Extracto simplificado de la gramática formal ^[1] para el lenguaje de programación C (izquierda) y una derivación de un fragmento de código C (derecha) del símbolo no terminal . Los símbolos no terminales y terminales se muestran en azul y rojo, respectivamente.

\langle {\text{Stmt}}\rangle

En la teoría del lenguaje formal , una gramática libre de contexto ( CFG ) es una gramática formal cuyas reglas de producción son de la forma

A\ \to \ \alpha

con un solo símbolo no terminal y una cadena de terminales y / o no terminales ( puede estar vacío). Una gramática formal está "libre de contexto" si sus reglas de producción se pueden aplicar independientemente del contexto de un no terminal. Independientemente de los símbolos que lo rodeen, el no terminal único en el lado izquierdo siempre se puede reemplazar por el lado derecho. Esto es lo que lo distingue de una gramática sensible al contexto . $A$ $\alpha$ $\alpha$

Una gramática formal es esencialmente un conjunto de reglas de producción que describen todas las cadenas posibles en un lenguaje formal dado. Las reglas de producción son reemplazos simples. Por ejemplo, la primera regla en la imagen,

\langle {\text{Stmt}}\rangle \to \langle {\text{Id}}\rangle =\langle {\text{Expr}}\rangle ;

reemplaza con . Puede haber varias reglas de reemplazo para un símbolo no terminal dado. El lenguaje generado por una gramática es el conjunto de todas las cadenas de símbolos terminales que pueden derivarse, mediante aplicaciones de reglas repetidas, de algún símbolo no terminal particular ("símbolo de inicio"). Los símbolos no terminales se utilizan durante el proceso de derivación, pero no aparecen en su cadena de resultado final. $\langle {\text{Stmt}}\rangle$ $\langle {\text{Id}}\rangle =\langle {\text{Expr}}\rangle ;$

Los lenguajes generados por gramáticas libres de contexto se conocen como lenguajes libres de contexto (CFL). Diferentes gramáticas libres de contexto pueden generar el mismo lenguaje libre de contexto. Es importante distinguir las propiedades del lenguaje (propiedades intrínsecas) de las propiedades de una gramática particular (propiedades extrínsecas). La cuestión de la igualdad del lenguaje (¿dos gramáticas libres de contexto dadas generan el mismo lenguaje?) Es indecidible .

Las gramáticas libres de contexto surgen en lingüística donde se utilizan para describir la estructura de oraciones y palabras en un lenguaje natural , y de hecho fueron inventadas por el lingüista Noam Chomsky para este propósito. Por el contrario, en informática , a medida que aumentaba el uso de conceptos definidos de forma recursiva, se utilizaban cada vez más. En una aplicación temprana, las gramáticas se utilizan para describir la estructura de los lenguajes de programación . En una aplicación más nueva, se utilizan en una parte esencial del Lenguaje de marcado extensible (XML) llamado Definición de tipo de documento . ^[2]

En lingüística , algunos autores utilizan el término gramática de estructura de frase para referirse a gramáticas libres de contexto, en las que las gramáticas de estructura de frase son distintas de las gramáticas de dependencia . En informática , una notación popular para gramáticas libres de contexto es la forma Backus-Naur , o BNF .

Antecedentes [ editar ]

Desde la época de Pāṇini , al menos, los lingüistas han descrito las gramáticas de los idiomas en términos de su estructura de bloques y han descrito cómo las oraciones se construyen recursivamente a partir de frases más pequeñas y, finalmente, palabras individuales o elementos de palabras. Una propiedad esencial de estas estructuras de bloques es que las unidades lógicas nunca se superponen. Por ejemplo, la oración:

John, cuyo auto azul estaba en el garaje, caminó hacia la tienda de comestibles.

puede estar entre paréntesis lógicamente (con los metasímbolos lógicos [] ) de la siguiente manera:

[ John [ , [ cuya [ coche azul ]] [ era [ en [ el garaje ]]] , ]] [ acercó [ a [ el [ supermercado ]]]] .

Una gramática libre de contexto proporciona un mecanismo simple y matemáticamente preciso para describir los métodos mediante los cuales las frases en algún lenguaje natural se construyen a partir de bloques más pequeños, capturando la "estructura de bloques" de las oraciones de forma natural. Su simplicidad hace que el formalismo sea susceptible de un riguroso estudio matemático. Las características importantes de la sintaxis del lenguaje natural, como la concordancia y la referencia , no son parte de la gramática libre de contexto, sino la estructura recursiva básica de las oraciones, la forma en que las cláusulas se anidan dentro de otras cláusulas y la forma en que las listas de adjetivos y adverbios tragado por sustantivos y verbos, se describe exactamente.

Las gramáticas libres de contexto son una forma especial de sistemas Semi-Thue que en su forma general se remontan al trabajo de Axel Thue .

El formalismo de las gramáticas libres de contexto fue desarrollado a mediados de la década de 1950 por Noam Chomsky , ^[3] y también su clasificación como un tipo especial de gramática formal (que él llamó gramáticas de estructura sintagmática ). ^[4] Lo que Chomsky llamó una gramática de estructura de frase también se conoce ahora como una gramática de circunscripción, en la que las gramáticas de circunscripción se contraponen a las gramáticas de dependencia . En el marco gramatical generativo de Chomsky , la sintaxis del lenguaje natural se describía mediante reglas libres de contexto combinadas con reglas de transformación.

La estructura de bloques fue introducida en los lenguajes de programación de computadoras por el proyecto Algol (1957-1960), que, como consecuencia, también presentó una gramática libre de contexto para describir la sintaxis de Algol resultante. Esto se convirtió en una característica estándar de los lenguajes informáticos, y la notación de las gramáticas utilizadas en las descripciones concretas de los lenguajes informáticos llegó a conocerse como forma Backus-Naur , en honor a dos miembros del comité de diseño del lenguaje Algol. ^[3]El aspecto de la "estructura de bloques" que capturan las gramáticas libres de contexto es tan fundamental para la gramática que los términos sintaxis y gramática a menudo se identifican con reglas gramaticales libres de contexto, especialmente en informática. Las restricciones formales no capturadas por la gramática se consideran parte de la "semántica" del lenguaje.

Las gramáticas libres de contexto son lo suficientemente simples como para permitir la construcción de algoritmos de análisis eficientes que, para una cadena dada, determinan si se puede generar a partir de la gramática y cómo. Un analizador de Earley es un ejemplo de dicho algoritmo, mientras que los analizadores de LR y LL son algoritmos más simples que tratan solo con subconjuntos más restrictivos de gramáticas libres de contexto.

Definiciones formales [ editar ]

Una gramática libre de contexto $G$ se define por la tupla de 4 , donde ^[5] $G=(V,\Sigma ,R,S)$

$V$ es un conjunto finito; cada elemento se llama carácter no terminal o variable . Cada variable representa un tipo diferente de frase o cláusula en la oración. Las variables también se denominan a veces categorías sintácticas. Cada variable define una sub-idioma del lenguaje definido por $G$ . $v\in V$
$Σ$ es un conjunto finito de terminales s, disjunto de $V$ , que constituyen el contenido real de la oración. El conjunto de terminales es el alfabeto del idioma definido por la gramática $G$ .
$R$ es una relación finita en , donde el asterisco representa la operación de la estrella de Kleene . Los miembros de $R$ se denominan reglas (de reescritura) o producciones de la gramática. (también comúnmente simbolizado por una $P$ ) $V\times (V\cup \Sigma )^{*}$
$S$ es la variable de inicio (o símbolo de inicio), que se utiliza para representar la oración completa (o programa). Debe ser un elemento de $V$ .

Notación de reglas de producción [ editar ]

Una regla de producción en $R$ se formaliza matemáticamente como un par , donde es un no terminal y es una cadena de variables y / o terminales; en lugar de utilizar par ordenado notación, reglas de producción se escriben generalmente usando un operador flecha con como su lado izquierdo y $β$ como su lado derecho: . $(\alpha ,\beta )\in R$ $\alpha \in V$ $\beta \in (V\cup \Sigma )^{*}$ $\alpha$ $\alpha \rightarrow \beta$

Se permite que $β$ sea la cadena vacía , y en este caso se acostumbra denotarla por ε. La forma se llama producción $ε$ . ^[6] $\alpha \rightarrow \varepsilon$

Es común enumerar todos los lados derechos del mismo lado izquierdo en la misma línea, usando | (el símbolo de la tubería ) para separarlos. Reglas y , por lo tanto, se puede escribir como . En este caso, y se denominan primera y segunda alternativa, respectivamente. $\alpha \rightarrow \beta _{1}$ $\alpha \rightarrow \beta _{2}$ $\alpha \rightarrow \beta _{1}\mid \beta _{2}$ $\beta _{1}$ $\beta _{2}$

Aplicación de reglas [ editar ]

Para cualquier cadena , decimos que $u$ produce directamente $v$ , escrito como , si con y tal que y . Por tanto, $v$ es el resultado de aplicar la regla a $u$ . $u,v\in (V\cup \Sigma )^{*}$ $u\Rightarrow v\,$ $\exists (\alpha ,\beta )\in R$ $\alpha \in V$ $u_{1},u_{2}\in (V\cup \Sigma )^{*}$ $u\,=u_{1}\alpha u_{2}$ $v\,=u_{1}\beta u_{2}$ $(\alpha ,\beta )$

Aplicación de reglas repetitivas [ editar ]

Para cualquier cadenas decimos $u$ produce $v$ o $v$ está derivada de $u$ si hay un número entero positivo $k$ y cuerdas de tal manera que . Esta relación se denota , o en algunos libros de texto. Si , la relación se mantiene. En otras palabras, y son el cierre transitivo reflexivo (que permite a una cuerda ceder) y el cierre transitivo (que requiere al menos un paso) de , respectivamente. $u,v\in (V\cup \Sigma )^{*},$ $u_{1},\cdots ,u_{k}\in (V\cup \Sigma )^{*}$ $u=u_{1}\Rightarrow u_{2}\Rightarrow \cdots \Rightarrow u_{k}=v$ $u{\stackrel {*}{\Rightarrow }}v$ $u\Rightarrow \Rightarrow v$ $k\geq 2$ $u{\stackrel {+}{\Rightarrow }}v$ $({\stackrel {*}{\Rightarrow }})$ $({\stackrel {+}{\Rightarrow }})$ $(\Rightarrow )$

Lenguaje libre de contexto [ editar ]

El lenguaje de una gramática es el conjunto $G=(V,\Sigma ,R,S)$

L(G)=\{w\in \Sigma ^{*}:S{\stackrel {*}{\Rightarrow }}w\}

of all terminal-symbol strings derivable from the start symbol.

A language $L$ is said to be a context-free language (CFL), if there exists a CFG $G$ , such that $L\,=\,L(G)$ .

Non-deterministic pushdown automata recognize exactly the context-free languages.

Examples[edit]

This section needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed. (July 2018) (Learn how and when to remove this template message)

Words concatenated with their reverse[edit]

The grammar $G=(\{S\},\{a,b\},P,S)$ , with productions

S \to aSa

,

S \to bSb

,

S \to ε

,

is context-free. It is not proper since it includes an ε-production. A typical derivation in this grammar is

S \to aSa \to aaSaa \to aabSbaa \to aabbaa

.

This makes it clear that $L(G)=\{ww^{R}:w\in \{a,b\}^{*}\}$ . The language is context-free, however, it can be proved that it is not regular.

If the productions

S \to a

,

S \to b

,

are added, a context-free grammar for the set of all palindromes over the alphabet ${a, b}$ is obtained.^[7]

Well-formed parentheses[edit]

The canonical example of a context-free grammar is parenthesis matching, which is representative of the general case. There are two terminal symbols "(" and ")" and one nonterminal symbol S. The production rules are

S \to SS

,

S \to (S)

,

S \to ()

The first rule allows the S symbol to multiply; the second rule allows the S symbol to become enclosed by matching parentheses; and the third rule terminates the recursion.^[8]

Well-formed nested parentheses and square brackets[edit]

A second canonical example is two different kinds of matching nested parentheses, described by the productions:

S \to SS

S \to ()

S \to (S)

S \to []

S \to [S]

with terminal symbols [ ] ( ) and nonterminal S.

The following sequence can be derived in that grammar:

([ [ [ ()() [ ][ ] ] ]([ ]) ])

Matching pairs[edit]

In a context-free grammar, we can pair up characters the way we do with brackets. The simplest example:

S \to aSb

S \to ab

This grammar generates the language $\{a^{n}b^{n}:n\geq 1\}$ , which is not regular (according to the pumping lemma for regular languages).

The special character ε stands for the empty string. By changing the above grammar to

S \to aSb

S \to ε

we obtain a grammar generating the language $\{a^{n}b^{n}:n\geq 0\}$ instead. This differs only in that it contains the empty string while the original grammar did not.

Distinct number of a's and b's[edit]

A context-free grammar for the language consisting of all strings over {a,b} containing an unequal number of a's and b's:

S \to T | U

T \to VaT | VaV | TaV

U \to VbU | VbV | UbV

V \to aVbV | bVaV | ε

Here, the nonterminal T can generate all strings with more a's than b's, the nonterminal U generates all strings with more b's than a's and the nonterminal V generates all strings with an equal number of a's and b's. Omitting the third alternative in the rules for T and U doesn't restrict the grammar's language.

Second block of b's of double size[edit]

Another example of a non-regular language is $\{{\text{b}}^{n}{\text{a}}^{m}{\text{b}}^{2n}:n\geq 0,m\geq 0\}$ . It is context-free as it can be generated by the following context-free grammar:

S \to bSbb | A

A \to aA | ε

First-order logic formulas[edit]

The formation rules for the terms and formulas of formal logic fit the definition of context-free grammar, except that the set of symbols may be infinite and there may be more than one start symbol.

Examples of languages that are not context free[edit]

In contrast to well-formed nested parentheses and square brackets in the previous section, there is no context-free grammar for generating all sequences of two different types of parentheses, each separately balanced disregarding the other, where the two types need not nest inside one another, for example:

[ ( ] )

or

[ [ [ [(((( ] ] ] ]))))(([ ))(([ ))([ )( ])( ])( ])

The fact that this language is not context free can be proven using Pumping lemma for context-free languages and a proof by contradiction, observing that all words of the form ${(}^{n}{[}^{n}{)}^{n}{]}^{n}$ should belong to the language. This language belongs instead to a more general class and can be described by a conjunctive grammar, which in turn also includes other non-context-free languages, such as the language of all words of the form ${\text{a}}^{n}{\text{b}}^{n}{\text{c}}^{n}$ .

Regular grammars[edit]

Every regular grammar is context-free, but not all context-free grammars are regular.^[9] The following context-free grammar, however, is also regular.

S \to a

S \to aS

S \to bS

The terminals here are $a$ and $b$ , while the only nonterminal is $S$ . The language described is all nonempty strings of $a$ s and $b$ s that end in $a$ .

This grammar is regular: no rule has more than one nonterminal in its right-hand side, and each of these nonterminals is at the same end of the right-hand side.

Every regular grammar corresponds directly to a nondeterministic finite automaton, so we know that this is a regular language.

Using pipe symbols, the grammar above can be described more tersely as follows:

S \to a | aS | bS

Derivations and syntax trees[edit]

A derivation of a string for a grammar is a sequence of grammar rule applications that transform the start symbol into the string. A derivation proves that the string belongs to the grammar's language.

A derivation is fully determined by giving, for each step:

the rule applied in that step
the occurrence of its left-hand side to which it is applied

For clarity, the intermediate string is usually given as well.

For instance, with the grammar:

$S \to S + S$
$S \to 1$
$S \to a$

the string

1 + 1 + a

can be derived from the start symbol $S$ with the following derivation:

S

\to S + S

(by rule 1. on

S

)

\to S + S + S

(by rule 1. on the second

S

)

\to 1 + S + S

(by rule 2. on the first

S

)

\to 1 + 1 + S

(by rule 2. on the second

S

)

\to 1 + 1 + a

(by rule 3. on the third

S

)

Often, a strategy is followed that deterministically chooses the next nonterminal to rewrite:

in a leftmost derivation, it is always the leftmost nonterminal;
in a rightmost derivation, it is always the rightmost nonterminal.

Given such a strategy, a derivation is completely determined by the sequence of rules applied. For instance, one leftmost derivation of the same string is

S

\to S + S

(by rule 1 on the leftmost

S

)

\to 1 + S

(by rule 2 on the leftmost

S

)

\to 1 + S + S

(by rule 1 on the leftmost

S

)

\to 1 + 1 + S

(by rule 2 on the leftmost

S

)

\to 1 + 1 + a

(by rule 3 on the leftmost

S

),

which can be summarized as

rule 1

rule 2

rule 1

rule 2

rule 3.

One rightmost derivation is:

S

\to S + S

(by rule 1 on the rightmost

S

)

\to S + S + S

(by rule 1 on the rightmost

S

)

\to S + S + a

(by rule 3 on the rightmost

S

)

\to S + 1 + a

(by rule 2 on the rightmost

S

)

\to 1 + 1 + a

(by rule 2 on the rightmost

S

),

which can be summarized as

rule 1

rule 3

rule 2

rule 2.

The distinction between leftmost derivation and rightmost derivation is important because in most parsers the transformation of the input is defined by giving a piece of code for every grammar rule that is executed whenever the rule is applied. Therefore, it is important to know whether the parser determines a leftmost or a rightmost derivation because this determines the order in which the pieces of code will be executed. See for an example LL parsers and LR parsers.

A derivation also imposes in some sense a hierarchical structure on the string that is derived. For example, if the string "1 + 1 + a" is derived according to the leftmost derivation outlined above, the structure of the string would be:

{{1} S + {{1} S + {a} S} S} S

where ${...} S$ indicates a substring recognized as belonging to $S$ . This hierarchy can also be seen as a tree:

This tree is called a parse tree or "concrete syntax tree" of the string, by contrast with the abstract syntax tree. In this case the presented leftmost and the rightmost derivations define the same parse tree; however, there is another rightmost derivation of the same string

S

\to S + S

(by rule 1 on the rightmost

S

)

\to S + a

(by rule 3 on the rightmost

S

)

\to S + S + a

(by rule 1 on the rightmost

S

)

\to S + 1 + a

(by rule 2 on the rightmost

S

)

\to 1 + 1 + a

(by rule 2 on the rightmost

S

),

which defines a string with a different structure

{{{1} S + {a} S} S + {a} S} S

and a different parse tree:

Note however that both parse trees can be obtained by both leftmost and rightmost derivations. For example, the last tree can be obtained with the leftmost derivation as follows:

S

\to S + S

(by rule 1 on the leftmost

S

)

\to S + S + S

(by rule 1 on the leftmost

S

)

\to 1 + S + S

(by rule 2 on the leftmost

S

)

\to 1 + 1 + S

(by rule 2 on the leftmost

S

)

\to 1 + 1 + a

(by rule 3 on the leftmost

S

),

If a string in the language of the grammar has more than one parsing tree, then the grammar is said to be an ambiguous grammar. Such grammars are usually hard to parse because the parser cannot always decide which grammar rule it has to apply. Usually, ambiguity is a feature of the grammar, not the language, and an unambiguous grammar can be found that generates the same context-free language. However, there are certain languages that can only be generated by ambiguous grammars; such languages are called inherently ambiguous languages.

Example: Algebraic expressions[edit]

Here is a context-free grammar for syntactically correct infix algebraic expressions in the variables x, y and z:

$S \to x$
$S \to y$
$S \to z$
$S \to S + S$
$S \to S - S$
$S \to S * S$
$S \to S / S$
$S \to (S)$

This grammar can, for example, generate the string

(x + y) * x - z * y / (x + x)

as follows:

S

\to S - S

(by rule 5)

\to S * S - S

(by rule 6, applied to the leftmost

S

)

\to S * S - S / S

(by rule 7, applied to the rightmost

S

)

\to (S) * S - S / S

(by rule 8, applied to the leftmost

S

)

\to (S) * S - S / (S)

(by rule 8, applied to the rightmost

S

)

\to (S + S) * S - S / (S)

(by rule 4, applied to the leftmost

S

)

\to (S + S) * S - S * S / (S)

(by rule 6, applied to the fourth

S

)

\to (S + S) * S - S * S / (S + S)

(by rule 4, applied to the rightmost

S

)

\to (x + S) * S - S * S / (S + S)

(etc.)

\to (x + y) * S - S * S / (S + S)

\to (x + y) * x - S * S / (S + S)

\to (x + y) * x - z * S / (S + S)

\to (x + y) * x - z * y / (S + S)

\to (x + y) * x - z * y / (x + S)

\to (x + y) * x - z * y / (x + x)

Note that many choices were made underway as to which rewrite was going to be performed next. These choices look quite arbitrary. As a matter of fact, they are, in the sense that the string finally generated is always the same. For example, the second and third rewrites

\to S * S - S

(by rule 6, applied to the leftmost

S

)

\to S * S - S / S

(by rule 7, applied to the rightmost

S

)

could be done in the opposite order:

\to S - S / S

(by rule 7, applied to the rightmost

S

)

\to S * S - S / S

(by rule 6, applied to the leftmost

S

)

Also, many choices were made on which rule to apply to each selected $S$ . Changing the choices made and not only the order they were made in usually affects which terminal string comes out at the end.

Let's look at this in more detail. Consider the parse tree of this derivation:

Starting at the top, step by step, an S in the tree is expanded, until no more unexpanded $S$ es (nonterminals) remain. Picking a different order of expansion will produce a different derivation, but the same parse tree. The parse tree will only change if we pick a different rule to apply at some position in the tree.

But can a different parse tree still produce the same terminal string, which is $(x + y) * x - z * y / (x + x)$ in this case? Yes, for this particular grammar, this is possible. Grammars with this property are called ambiguous.

For example, $x + y * z$ can be produced with these two different parse trees:

However, the language described by this grammar is not inherently ambiguous: an alternative, unambiguous grammar can be given for the language, for example:

T \to x

T \to y

T \to z

S \to S + T

S \to S - T

S \to S * T

S \to S / T

T \to (S)

S \to T

,

once again picking $S$ as the start symbol. This alternative grammar will produce $x + y * z$ with a parse tree similar to the left one above, i.e. implicitly assuming the association $(x + y) * z$ , which does not follow standard order of operations. More elaborate, unambiguous and context-free grammars can be constructed that produce parse trees that obey all desired operator precedence and associativity rules.

Normal forms[edit]

Every context-free grammar with no ε-production has an equivalent grammar in Chomsky normal form, and a grammar in Greibach normal form. "Equivalent" here means that the two grammars generate the same language.

The especially simple form of production rules in Chomsky normal form grammars has both theoretical and practical implications. For instance, given a context-free grammar, one can use the Chomsky normal form to construct a polynomial-time algorithm that decides whether a given string is in the language represented by that grammar or not (the CYK algorithm).

Closure properties[edit]

Context-free languages are closed under the various operations, that is, if the languages K and L are context-free, so is the result of the following operations:

union K ∪ L; concatenation K ∘ L; Kleene star L^*^[10]
substitution (in particular homomorphism)^[11]
inverse homomorphism^[12]
intersection with a regular language^[13]

They are not closed under general intersection (hence neither under complementation) and set difference.^[14]

Decidable problems[edit]

The following are some decidable problems about context-free grammars.

Parsing[edit]

The parsing problem, checking whether a given word belongs to the language given by a context-free grammar, is decidable, using one of the general-purpose parsing algorithms:

CYK algorithm (for grammars in Chomsky normal form)
Earley parser
GLR parser
LL parser (only for the proper subclass of for LL(k) grammars)

Context-free parsing for Chomsky normal form grammars was shown by Leslie G. Valiant to be reducible to boolean matrix multiplication, thus inheriting its complexity upper bound of O(n^2.3728639).^[15]^[16]^{[note 1]} Conversely, Lillian Lee has shown O(n^3−ε) boolean matrix multiplication to be reducible to O(n^3−3ε) CFG parsing, thus establishing some kind of lower bound for the latter.^[17]

Reachability, productiveness, nullability[edit]

Example grammar:
S → Bb \| Cc \| Ee
B → Bb \| b
C → C
D → Bd \| Cd \| d
E → Ee

A nonterminal symbol $X$ is called productive, or generating, if there is a derivation $X{\stackrel {*}{\Rightarrow }}w$ for some string $w$ of terminal symbols. $X$ is called reachable if there is a derivation $S{\stackrel {*}{\Rightarrow }}\alpha X\beta$ for some strings $\alpha ,\beta$ of non-terminal and terminal symbols from the start symbol. $X$ is called useless if it is unreachable or unproductive. $X$ is called nullable if there is a derivation $X{\stackrel {*}{\Rightarrow }}\varepsilon$ . A rule $X\rightarrow \varepsilon$ is called an ε-production. A derivation $X{\stackrel {+}{\Rightarrow }}X$ is called a cycle.

Algorithms are known to eliminate from a given grammar, without changing its generated language,

unproductive symbols,^[18]^{[note 2]}
unreachable symbols,^[20]^[21]^[22]
ε-productions, with one possible exception,^{[note 3]}^[23] and
cycles.^{[note 4]}

In particular, an alternative containing a useless nonterminal symbol can be deleted from the right-hand side of a rule. Such rules and alternatives are called useless.^[24]

In the depicted example grammar, the nonterminal D is unreachable, and E is unproductive, while C → C causes a cycle. Hence, omitting the last three rules doesn't change the language generated by the grammar, nor does omitting the alternatives "| Cc | Ee" from the right-hand side of the rule for S.

A context-free grammar is said to be proper if it has neither useless symbols nor ε-productions nor cycles.^[25] Combining the above algorithms, every context-free grammar not generating ε can be transformed into a weakly equivalent proper one.

Regularity and LL(k) checks[edit]

It is decidable whether a given grammar is a regular grammar,^[26] as well as whether it is an LL(k) grammar for a given k≥0.^[27]^:233 If k is not given, the latter problem is undecidable.^[27]^:252

Given a context-free language, it is neither decidable whether it is regular,^[28] nor whether it is an LL(k) language for a given k.^[27]^:254

Emptiness and finiteness[edit]

There are algorithms to decide whether a language of a given context-free language is empty, as well as whether it is finite.^[29]

Undecidable problems[edit]

Some questions that are undecidable for wider classes of grammars become decidable for context-free grammars; e.g. the emptiness problem (whether the grammar generates any terminal strings at all), is undecidable for context-sensitive grammars, but decidable for context-free grammars.

However, many problems are undecidable even for context-free grammars. Examples are:

Universality[edit]

Given a CFG, does it generate the language of all strings over the alphabet of terminal symbols used in its rules?^[30]^[31]

A reduction can be demonstrated to this problem from the well-known undecidable problem of determining whether a Turing machine accepts a particular input (the halting problem). The reduction uses the concept of a computation history, a string describing an entire computation of a Turing machine. A CFG can be constructed that generates all strings that are not accepting computation histories for a particular Turing machine on a particular input, and thus it will accept all strings only if the machine doesn't accept that input.

Language equality[edit]

Given two CFGs, do they generate the same language?^[31]^[32]

The undecidability of this problem is a direct consequence of the previous: it is impossible to even decide whether a CFG is equivalent to the trivial CFG defining the language of all strings.

Language inclusion[edit]

Given two CFGs, can the first one generate all strings that the second one can generate?^[31]^[32]

If this problem was decidable, then language equality could be decided too: two CFGs G1 and G2 generate the same language if L(G1) is a subset of L(G2) and L(G2) is a subset of L(G1).

Being in a lower or higher level of the Chomsky hierarchy[edit]

Using Greibach's theorem, it can be shown that the two following problems are undecidable:

Given a context-sensitive grammar, does it describe a context-free language?
Given a context-free grammar, does it describe a regular language?^[31]^[32]

Grammar ambiguity[edit]

Given a CFG, is it ambiguous?

The undecidability of this problem follows from the fact that if an algorithm to determine ambiguity existed, the Post correspondence problem could be decided, which is known to be undecidable.

Language disjointness[edit]

Given two CFGs, is there any string derivable from both grammars?

If this problem was decidable, the undecidable Post correspondence problem could be decided, too: given strings $\alpha _{1},\ldots ,\alpha _{N},\beta _{1},\ldots ,\beta _{N}$ over some alphabet $\{a_{1},\ldots ,a_{k}\}$ , let the grammar $G_{1}$ consist of the rule

S\to \alpha _{1}S\beta _{1}^{rev}|\cdots |\alpha _{N}S\beta _{N}^{rev}|b

;

where $\beta _{i}^{rev}$ denotes the reversed string $\beta _{i}$ and $b$ doesn't occur among the $a_{i}$ ; and let grammar $G_{2}$ consist of the rule

T\to a_{1}Ta_{1}|\cdots |a_{k}Ta_{k}|b

;

Then the Post problem given by $\alpha _{1},\ldots ,\alpha _{N},\beta _{1},\ldots ,\beta _{N}$ has a solution if and only if $L(G_{1})$ and $L(G_{2})$ share a derivable string.

Extensions[edit]

An obvious way to extend the context-free grammar formalism is to allow nonterminals to have arguments, the values of which are passed along within the rules. This allows natural language features such as agreement and reference, and programming language analogs such as the correct use and definition of identifiers, to be expressed in a natural way. E.g. we can now easily express that in English sentences, the subject and verb must agree in number. In computer science, examples of this approach include affix grammars, attribute grammars, indexed grammars, and Van Wijngaarden two-level grammars. Similar extensions exist in linguistics.

An extended context-free grammar (or regular right part grammar) is one in which the right-hand side of the production rules is allowed to be a regular expression over the grammar's terminals and nonterminals. Extended context-free grammars describe exactly the context-free languages.^[33]

Another extension is to allow additional terminal symbols to appear at the left-hand side of rules, constraining their application. This produces the formalism of context-sensitive grammars.

Subclasses[edit]

There are a number of important subclasses of the context-free grammars:

LR(k) grammars (also known as deterministic context-free grammars) allow parsing (string recognition) with deterministic pushdown automata (PDA), but they can only describe deterministic context-free languages.
Simple LR, Look-Ahead LR grammars are subclasses that allow further simplification of parsing. SLR and LALR are recognized using the same PDA as LR, but with simpler tables, in most cases.
LL(k) and LL(*) grammars allow parsing by direct construction of a leftmost derivation as described above, and describe even fewer languages.
Simple grammars are a subclass of the LL(1) grammars mostly interesting for its theoretical property that language equality of simple grammars is decidable, while language inclusion is not.
Bracketed grammars have the property that the terminal symbols are divided into left and right bracket pairs that always match up in rules.
Linear grammars have no rules with more than one nonterminal on the right-hand side.
Regular grammars are a subclass of the linear grammars and describe the regular languages, i.e. they correspond to finite automata and regular expressions.

LR parsing extends LL parsing to support a larger range of grammars; in turn, generalized LR parsing extends LR parsing to support arbitrary context-free grammars. On LL grammars and LR grammars, it essentially performs LL parsing and LR parsing, respectively, while on nondeterministic grammars, it is as efficient as can be expected. Although GLR parsing was developed in the 1980s, many new language definitions and parser generators continue to be based on LL, LALR or LR parsing up to the present day.

Linguistic applications[edit]

Chomsky initially hoped to overcome the limitations of context-free grammars by adding transformation rules.^[4]

Such rules are another standard device in traditional linguistics; e.g. passivization in English. Much of generative grammar has been devoted to finding ways of refining the descriptive mechanisms of phrase-structure grammar and transformation rules such that exactly the kinds of things can be expressed that natural language actually allows. Allowing arbitrary transformations does not meet that goal: they are much too powerful, being Turing complete unless significant restrictions are added (e.g. no transformations that introduce and then rewrite symbols in a context-free fashion).

Chomsky's general position regarding the non-context-freeness of natural language has held up since then,^[34] although his specific examples regarding the inadequacy of context-free grammars in terms of their weak generative capacity were later disproved.^[35]Gerald Gazdar and Geoffrey Pullum have argued that despite a few non-context-free constructions in natural language (such as cross-serial dependencies in Swiss German^[34] and reduplication in Bambara^[36]), the vast majority of forms in natural language are indeed context-free.^[35]

References[edit]

^ Brian W. Kernighan and Dennis M. Ritchie (Apr 1988). The C Programming Language. Prentice Hall Software Series (2nd ed.). Englewood Cliffs/NJ: Prentice Hall. ISBN 0131103628. Here: App.A
^ Introduction to Automata Theory, Languages, and Computation, John E. Hopcroft, Rajeev Motwani, Jeffrey D. Ullman, Addison Wesley, 2001, p.191
^ a b Hopcroft & Ullman (1979), p. 106.
^ a b Chomsky, Noam (Sep 1956), "Three models for the description of language", IEEE Transactions on Information Theory, 2 (3): 113–124, doi:10.1109/TIT.1956.1056813
^ The notation here is that of Sipser (1997), p. 94. Hopcroft & Ullman (1979) (p. 79) define context-free grammars as 4-tuples in the same way, but with different variable names.
^ Hopcroft & Ullman (1979), pp. 90–92.
^ Hopcroft & Ullman (1979), Exercise 4.1a, p. 103.
^ Hopcroft & Ullman (1979), Exercise 4.1b, p. 103.
^ Aho, Alfred Vaino; Lam, Monica S.; Sethi, Ravi; Ullman, Jeffrey David (2007). "4.2.7 Context-Free Grammars Versus Regular Expressions" (print). Compilers: Principles, Techniques, & Tools (2nd ed.). Boston, MA USA: Pearson Addison-Wesley. pp. 205–206. ISBN 9780321486813. Every construct that can be described by a regular expression can be described by a [context-free] grammar, but not vice-versa.
^ Hopcroft & Ullman (1979), p.131, Theorem 6.1
^ Hopcroft & Ullman (1979), pp.131–132, Theorem 6.2
^ Hopcroft & Ullman (1979), pp.132–134, Theorem 6.3
^ Hopcroft & Ullman (1979), pp.135–136, Theorem 6.5
^ Hopcroft & Ullman (1979), pp.134–135, Theorem 6.4
^ Leslie Valiant (Jan 1974). General context-free recognition in less than cubic time (Technical report). Carnegie Mellon University. p. 11.
^ Leslie G. Valiant (1975). "General context-free recognition in less than cubic time". Journal of Computer and System Sciences. 10 (2): 308–315. doi:10.1016/s0022-0000(75)80046-8.
^ Lillian Lee (2002). "Fast Context-Free Grammar Parsing Requires Fast Boolean Matrix Multiplication" (PDF). J ACM. 49 (1): 1–15. arXiv:cs/0112018. doi:10.1145/505241.505242.
^ Hopcroft & Ullman (1979), Lemma 4.1, p. 88.
^ Aiken, A.; Murphy, B. (1991). "Implementing Regular Tree Expressions". ACM Conference on Functional Programming Languages and Computer Architecture. pp. 427–447. CiteSeerX 10.1.1.39.3766.; here: Sect.4
^ Hopcroft & Ullman (1979), Lemma 4.2, p. 89.
^ Hopcroft, Motwani & Ullman (2003), Theorem 7.2, Sect.7.1, p.255ff
^ (PDF) https://www.springer.com/cda/content/document/cda_downloaddocument/9780387202488-c1.pdf?SGWID=0-0-45-466216-p52091986. Missing or empty |title= (help)
^ Hopcroft & Ullman (1979), Theorem 4.3, p. 90.
^ John E. Hopcroft; Rajeev Motwani; Jeffrey D. Ullman (2003). Introduction to Automata Theory, Languages, and Computation. Addison Wesley.; here: Sect.7.1.1, p.256
^ Nijholt, Anton (1980), Context-free grammars: covers, normal forms, and parsing, Lecture Notes in Computer Science, 93, Springer, p. 8, ISBN 978-3-540-10245-8, MR 0590047.
^ This is easy to see from the grammar definitions.
^ a b c D.J. Rosenkrantz and R.E. Stearns (1970). "Properties of Deterministic Top Down Grammars". Information and Control. 17 (3): 226–256. doi:10.1016/S0019-9958(70)90446-8.
^ Hopcroft & Ullman (1979), Exercise 8.10a, p. 214. The problem remains undecidable even if the language is produced by a "linear" context-free grammar (i.e., with at most one nonterminal in each rule's right-hand side, cf. Exercise 4.20, p. 105).
^ Hopcroft & Ullman (1979), pp.137–138, Theorem 6.6
^ Sipser (1997), Theorem 5.10, p. 181.
^ a b c d Hopcroft & Ullman (1979), p. 281.
^ a b c Hazewinkel, Michiel (1994), Encyclopaedia of mathematics: an updated and annotated translation of the Soviet "Mathematical Encyclopaedia", Springer, Vol. IV, p. 56, ISBN 978-1-55608-003-6.
^ Norvell, Theodore. "A Short Introduction to Regular Expressions and Context-Free Grammars" (PDF). p. 4. Retrieved August 24, 2012.
^ a b Shieber, Stuart (1985), "Evidence against the context-freeness of natural language" (PDF), Linguistics and Philosophy, 8 (3): 333–343, doi:10.1007/BF00630917.
^ a b Pullum, Geoffrey K.; Gerald Gazdar (1982), "Natural languages and context-free languages", Linguistics and Philosophy, 4 (4): 471–504, doi:10.1007/BF00360802.
^ Culy, Christopher (1985), "The Complexity of the Vocabulary of Bambara", Linguistics and Philosophy, 8 (3): 345–351, doi:10.1007/BF00630918.

Notes[edit]

^ In Valiant's papers, O(n^2.81) is given, the then best known upper bound. See Matrix multiplication#Algorithms for efficient matrix multiplication and Coppersmith–Winograd algorithm for bound improvements since then.
^ For regular tree grammars, Aiken and Murphy give a fixpoint algorithm to detect unproductive nonterminals.^[19]
^ If the grammar can generate $\varepsilon$ , a rule $S\rightarrow \varepsilon$ cannot be avoided.
^ This is a consequence of the unit-production elimination theorem in Hopcroft & Ullman (1979), p.91, Theorem 4.4

External links[edit]

Computer programmers may find the stack exchange answer to be useful.
Non-computer programmers will find more academic introductory materials to be enlightening.

[1] Brian W. Kernighan and Dennis M. Ritchie (Apr 1988). The C Programming Language. Prentice Hall Software Series (2nd ed.). Englewood Cliffs/NJ: Prentice Hall. ISBN 0131103628. Here: App.A

[2] Introduction to Automata Theory, Languages, and Computation, John E. Hopcroft, Rajeev Motwani, Jeffrey D. Ullman, Addison Wesley, 2001, p.191

[hu106-3] Hopcroft & Ullman (1979), p. 106.

[chomsky1956-4] Chomsky, Noam (Sep 1956), "Three models for the description of language", IEEE Transactions on Information Theory, 2 (3): 113–124, doi:10.1109/TIT.1956.1056813

[5] The notation here is that of Sipser (1997), p. 94. Hopcroft & Ullman (1979) (p. 79) define context-free grammars as 4-tuples in the same way, but with different variable names.

[6] Hopcroft & Ullman (1979), pp. 90–92.

[7] Hopcroft & Ullman (1979), Exercise 4.1a, p. 103.

[8] Hopcroft & Ullman (1979), Exercise 4.1b, p. 103.

[9] Aho, Alfred Vaino; Lam, Monica S.; Sethi, Ravi; Ullman, Jeffrey David (2007). "4.2.7 Context-Free Grammars Versus Regular Expressions" (print). Compilers: Principles, Techniques, & Tools (2nd ed.). Boston, MA USA: Pearson Addison-Wesley. pp. 205–206. ISBN 9780321486813. Every construct that can be described by a regular expression can be described by a [context-free] grammar, but not vice-versa.

[10] Hopcroft & Ullman (1979), p.131, Theorem 6.1

[11] Hopcroft & Ullman (1979), pp.131–132, Theorem 6.2

[12] Hopcroft & Ullman (1979), pp.132–134, Theorem 6.3

[13] Hopcroft & Ullman (1979), pp.135–136, Theorem 6.5

[14] Hopcroft & Ullman (1979), pp.134–135, Theorem 6.4

[15] Leslie Valiant (Jan 1974). General context-free recognition in less than cubic time (Technical report). Carnegie Mellon University. p. 11.

[16] Leslie G. Valiant (1975). "General context-free recognition in less than cubic time". Journal of Computer and System Sciences. 10 (2): 308–315. doi:10.1016/s0022-0000(75)80046-8.

[18] Lillian Lee (2002). "Fast Context-Free Grammar Parsing Requires Fast Boolean Matrix Multiplication" (PDF). J ACM. 49 (1): 1–15. arXiv:cs/0112018. doi:10.1145/505241.505242.

[19] Hopcroft & Ullman (1979), Lemma 4.1, p. 88.

[20] Aiken, A.; Murphy, B. (1991). "Implementing Regular Tree Expressions". ACM Conference on Functional Programming Languages and Computer Architecture. pp. 427–447. CiteSeerX 10.1.1.39.3766.; here: Sect.4

[22] Hopcroft & Ullman (1979), Lemma 4.2, p. 89.

[23] Hopcroft, Motwani & Ullman (2003), Theorem 7.2, Sect.7.1, p.255ff

[24] (PDF) https://www.springer.com/cda/content/document/cda_downloaddocument/9780387202488-c1.pdf?SGWID=0-0-45-466216-p52091986. Missing or empty |title= (help)

[26] Hopcroft & Ullman (1979), Theorem 4.3, p. 90.

[28] John E. Hopcroft; Rajeev Motwani; Jeffrey D. Ullman (2003). Introduction to Automata Theory, Languages, and Computation. Addison Wesley.; here: Sect.7.1.1, p.256

[29] Nijholt, Anton (1980), Context-free grammars: covers, normal forms, and parsing, Lecture Notes in Computer Science, 93, Springer, p. 8, ISBN 978-3-540-10245-8, MR 0590047.

[30] This is easy to see from the grammar definitions.

[Rosenkrantz.Stearns.1970-31] D.J. Rosenkrantz and R.E. Stearns (1970). "Properties of Deterministic Top Down Grammars". Information and Control. 17 (3): 226–256. doi:10.1016/S0019-9958(70)90446-8.

[32] Hopcroft & Ullman (1979), Exercise 8.10a, p. 214. The problem remains undecidable even if the language is produced by a "linear" context-free grammar (i.e., with at most one nonterminal in each rule's right-hand side, cf. Exercise 4.20, p. 105).

[33] Hopcroft & Ullman (1979), pp.137–138, Theorem 6.6

[sipser-34] Sipser (1997), Theorem 5.10, p. 181.

[hu281-35] Hopcroft & Ullman (1979), p. 281.

[eom-36] Hazewinkel, Michiel (1994), Encyclopaedia of mathematics: an updated and annotated translation of the Soviet "Mathematical Encyclopaedia", Springer, Vol. IV, p. 56, ISBN 978-1-55608-003-6.

[37] Norvell, Theodore. "A Short Introduction to Regular Expressions and Context-Free Grammars" (PDF). p. 4. Retrieved August 24, 2012.

[shieber1985-38] Shieber, Stuart (1985), "Evidence against the context-freeness of natural language" (PDF), Linguistics and Philosophy, 8 (3): 333–343, doi:10.1007/BF00630917.

[pullum-gazdar1982-39] Pullum, Geoffrey K.; Gerald Gazdar (1982), "Natural languages and context-free languages", Linguistics and Philosophy, 4 (4): 471–504, doi:10.1007/BF00360802.

[culy1985-40] Culy, Christopher (1985), "The Complexity of the Vocabulary of Bambara", Linguistics and Philosophy, 8 (3): 345–351, doi:10.1007/BF00630918.

[17] In Valiant's papers, O(n^2.81) is given, the then best known upper bound. See Matrix multiplication#Algorithms for efficient matrix multiplication and Coppersmith–Winograd algorithm for bound improvements since then.

[21] For regular tree grammars, Aiken and Murphy give a fixpoint algorithm to detect unproductive nonterminals.^[19]

[25] If the grammar can generate $\varepsilon$ , a rule $S\rightarrow \varepsilon$ cannot be avoided.

[27] This is a consequence of the unit-production elimination theorem in Hopcroft & Ullman (1979), p.91, Theorem 4.4

[1]

Gramática libre de contexto

Antecedentes [ editar ]

Definiciones formales [ editar ]

Notación de reglas de producción [ editar ]

Aplicación de reglas [ editar ]

Aplicación de reglas repetitivas [ editar ]

Lenguaje libre de contexto [ editar ]

Examples[edit]

Words concatenated with their reverse[edit]

Well-formed parentheses[edit]

Well-formed nested parentheses and square brackets[edit]

Matching pairs[edit]

Distinct number of a's and b's[edit]

Second block of b's of double size[edit]

First-order logic formulas[edit]

Examples of languages that are not context free[edit]

Regular grammars[edit]

Derivations and syntax trees[edit]

Example: Algebraic expressions[edit]

Normal forms[edit]

Closure properties[edit]

Decidable problems[edit]

Parsing[edit]

Reachability, productiveness, nullability[edit]

Regularity and LL(k) checks[edit]

Emptiness and finiteness[edit]

Undecidable problems[edit]

Universality[edit]

Language equality[edit]

Language inclusion[edit]

Being in a lower or higher level of the Chomsky hierarchy[edit]

Grammar ambiguity[edit]

Language disjointness[edit]

Extensions[edit]

Subclasses[edit]

Linguistic applications[edit]

See also[edit]

References[edit]

Notes[edit]

Further reading[edit]

External links[edit]