GECEG Manual: Syntactic Annotation

1. General Points

Some general points regarding syntactic annotation
terminology, "functional label", the innermost part of the label, e.g. ADT in ADT-ADV-TMP is "core" or "root"; "head", "token" = absolutely everyting, sentence + footer; "sentence", "footer"
annotate purely by function principle and consequences, for easy way to search by form, see FAQ; scheme: one head (f1 - to POS), everything else function (dependency grammar kind of idea). form is seens as redundant, can be unambiguously determined based on head (e.g. ADT-PP means, PREP head); always final P for phrase, i.e. form = phrasl categories based on form of head; just convenient as search option; only most important categories, not all categories
Headedness Principle,
outside wrapper (...), which contains all syntactic material; exceptions for headedness principle: FOREIGN, has a function, but isn't a head itself!; FOR TO VBN etc.; non-finite markers as more than one heads; FOREIGN has sequence of FW, all "heads"; GAPS / TAG that don't cooruc with a resumptive head are not headed. same for empty categories that aren't heads (EC *pro*). to avoid multiple heads on clause-level.
about ambiguities. there are genuine ambiguities and nothing can be done to resolve them. just one parse is used obviously. whatever seemed more likely to annotator (me). maybe as example Boeth sentence 5 (ihm näher, or , näher, ihm). just live with it, send message if you feel an annotation should be differently.

2. Syntactic Functions

what are syntactic functions? things like grammatical functions. LFG f-structure kind of idea. everything that can be decribed as a relation towards the head.;

The GeCeG annotates clausal grammatical functions by adding a rightward extension to the core label used for nominal grammatical functions. This design choice is called the Annotate by Complexity Principle: Core functional labels without an extension to the right are non-clausal. For example, SBJ indicates a category that functions as a nominal subject; ADT is a non-clausal adjunct, such as a prepositional phrase, adverb phrase or adverbial noun phrase etc. In contrast, core-functional labels with an extension to the right indicate embedding of a clausal category. The extension specifies its specific type. For example, SBJ-DCL is a category that functions as subject (hence SBJ) expressed as an embedded finite content clauses (hence DCL); ADT-NFN is a category that functions as adjunct (hence ADT) but is realized as a non-finite clause (hence NFN) etc.

2.1. Top-Level Functions

Top-level functions are categories that always immediately dominate clausal functions, like subjects or complements etc. There are six general types of top-level functions in the GeCeG: Matrix clauses (MAT-...), fragments (FRAG), finite subordinate clauses (SUB), non-finite subordinate clauses (...-NFN), secondary predicates (...-SPR) and parenthetical clauses (PRN-...).
The former two are the only root nodes used in the GeCeG, that is they are start symbols of the syntactic structure to which every other node is ultimately connected by a unique path. However, matrix clauses can also be embedded for example as direct speech complements. The latter four are embedded top-level categories; they always occur more deeply in the syntactic structure. As such they are top-level functions only by virtue of their inner syntax, i.e. the elements they immediately dominate are clausal functions, but they are clausal or nominal elements at the same time because the nodes under which they are embedded are clause-level or nominal-level functions.
Top-level categories are clausal, that is they must immediately dominate at least a head - typically a verb - as well as a subject. They are then called 'complete' and their requirement to contain a predicator as well as its subject is referred to as the Top-Level Completeness Principle. (This is similar to the meaning of the Extended Projection Principle (EPP) in popular usage). Fragments are the only top-level function for which this requirement does not hold. They are used for structures where a subject and head are not both available. They are therefore said to be 'incomplete.'

Comparison with other corpora The Top-Level Completeness Principle differentiates the GeCeG to some degree from other corpora, like the YCOE or PPCME. Non-finite clauses, for instance, can occur without the annotation of a subject in the latter but necessarily require the inclusion of a subject function in the former.

The following graphic illustrates these points.
Top-Level categories
In the descriptions below, only the root nodes (MAT-..., and FRAG) are described in more detail.
For finite subordinate clauses, see periphery functions.
For non-finite clauses and secondary predicates, see their respective core functional label (e.g. ADT for ADT-NFN-PRP etc.) under clause-level and nominal-level functions.
For parentheticals? I really don't know yet...
For fragmentary utterances within a top-level category, see Disfluencies.
For incomplete but reconstructible clauses, see Gapping under Coordination.

Top-Level Functions Overview List

The GeCeG uses the following 12 top-level categories.

Top-Level Label Mnemonic
MAT-DCL declarative matrix clause
MAT-QUE interrogative matrix clause
MAT-IMP imperative clause
FRAG fragment
SUB finite subordinate clause
DIR-NFN non-finite complement clause
ADT-NFN non-finite adjunct clause
ADT-NFN-PRP purposive non-finite adjunct clause
MOD-NFN non-finite modifier clause
DIR-SPR secondary predicate complement (small clause)
ADT-SPR secondary predicate adjunct
PRN-MAT-DCL parenthetical declarative clause

Root Node Details

FRAG (fragment)

Fragments are tokens for which there is not enough material to construct a matrix clause. That means that there is not a head and subject (SBJ) available. Fragments are always root nodes, i.e. they are the highest label dominating the rest of the sentence. Fragments include at least one clausal element or nominal element.
Typically, fragments are used for titles, which often consist of only one phrase.


MAT-DCL (declarative matrix clause)

Declarative main clauses have a grammatical form indicating that their content is a statement. They are the most common sentence type. Their constructions requires the availability of at least an immediately dominated head and subject (SBJ). But normally, other clausal elements or embedded top-level categories are present as well. Declaratives are typically root nodes, i.e. they contain every other syntactic node of the sentence.


MAT-IMP (imperative clause)

Imperatives have a grammatical form that indicates commands or prohibitions. They are complete clauses, i.e. they dominate at least a subject and a head. Their head is always an imperative verb, VBI. The subject is not normally expressed in second person singular and second person plural imperatives. Instead, the subject dominates an empty category, labelled *imp* (for 'imperative subject'). First person plural imperatives (adhortatives) usually have an overt first person plural subject pronoun (we).

Comparison with other corpora The GeCeG syntactic annotation for imperatives is different from the YCOE or the PPCME in that they necessarily include a subject function even if it is not overt, i.e. the empty category *imp*. The GeCeG is similar to the PPCME because both indicate imperatives as a sentence type (IP-IMP) while the YCOE labels them simply as matrix clauses (IP-MAT).


MAT-QUE (interrogative matrix clause)

Interrogative matrix clauses have a grammatical form that indicates that they are a direct question. As for other matrix clauses, their construction requires the availability of at least an immediately dominated head and subject, but normally other clausal elements or embedded top-level categories are present as well. Interrogatives are always root nodes; they contain every other syntactic node of the token.
There are two types of interrogatives: wh questions and yes/no questions. The former type necessarily includes a focused constituent, usually in initial position, which is unified with some other syntactic function, like subjects (who), objects (whom), locative adjuncts (where) etc. This scheme is represented as follows: The generic discourse function, DISC, includes the simplex or complex wh expression and carries a unification marker, TAG, with an index. An empty syntactic function, like subject, SBJ, adjunct, ADT, etc., dominates only a TAG marker with an identical index. This co-indexing mechanism represents unification of the syntactic function with the focused constituent. The empty category is always attached as high as possible after the question focus. It is usually found in the matrix clause itself and hence appears immediately after DISC, but it can also be embedded artificially deep inside the token in long-distance dependencies.

Comparison with other corpora The syntactic annotation for direct questions differs substantially from other corpora of the CorpusSearch family like the YCOE or the PPCME in that they do not include a finite subordinate clause, SUB, but are regarded as complete in themselves, with both a subject function and a head.


2.2. Clause-Level Functions

Clause-level functions are typically, though not necessarily, immediately dominated by top-level clauses, which is the reason for their name.
They can be divided into two classes. Firstly, they are dependents of clausal heads (e.g. an object in relation to a finite verb). That is to say, they are the core grammatical functions assumed in many theories of grammar, subjects, complements and adjuncts. Secondly, clause-level functions can introduce grammatical features into the syntactic structure (e.g. [voice = passive]).
The GeCeG distinguishes between five different kinds of clausal dependents: subjects, (SBJ), direct complements (DIR), indirect complements (IDR), predicates (PRD) and clausal adjuncts (ADT). By the Annotate by Complexity Principle, these core functional labels are used for non-clausal categories while rightward extension labels indicate specific kinds of clausal functions.
The label for grammatical feature functions is an acronym formed from aspect, voice, modality (AVM). Heads of this function are usually formally realized by finite verbs, but are analyzed with lacking argument structure. The specific feature introduced by AVM is indicated with a rightward extension, e.g. by AVM-PASSIVE for a passive auxiliary etc. This is a exception to the Annotate by Complexity Principle because the extension does not indicate a clausal category here. Note that this annotation scheme does not necessarily reflect a sound linguistic analysis. It is probable that early German modal verbs or perfect auxiliaries are in fact full verbs with their own argument structure, selecting non-finite clauses as their complements. Nevertheless, the GeCeG treats markers of aspect, voice and modality as clause-level functions because this design facilitates searches for such constructions.
The graphic below summarizes these points.
Top-Level categories
In addition, the generic discourse function, discussed under discourse functions, can occur as a clause-level category.

Clause-Level Functions Overview List

The GeCeG uses the following 22 clause-level functions.

Clause-Level Label Mnemonic
SBJ subject
SBJ-DCL declarative subject clause
DIR direct complement
DIR-DCL declarative direct complement clause
DIR-QUE interrogative direct complement clause
DIR-FRL free relative direct complement clause
DIR-NFN non-finite direct complement clause
DIR-SPR secondary predicate direct complement clause
IDR indirect complement
IDR-FRL free relative indirect complement clause
ADT adjunct
ADT-ADV adverbial adjunct clause
ADT-FRL free relative adjunct clause
ADT-NFN non-finite adjunct clause
ADT-NFN-PRP purposive non-finite adjunct clause
ADT-SPR secondary predicate adjunct clause
PRD predicate
AVM-PASSIVE passive auxiliary
AVM-PERFECT perfect auxiliary
AVM-MODAL modal auxiliary
NEGAT negation
NFMARK non-finite marker

Clause-Level Function Details

ADT (adjunct)

Clause-level adjuncts are labelled ADT. They specify additional, optional information in relation to the predicator, for instance information about place and time of an event. By the Annotate by Complexity Principle, only adjuncts that do not embed a top-level category are annotated as ADT while more complex adjuncts take a rightward extension label. Thus, ADT is most commonly headed by a preposition or adverb (but not by a complementizer). These two categories are annotated as adjuncts by default, but see direct complements and indirect complements for cases of complement prepositional and adverbial phrases. Other word classes, for example nouns, can project adjunct functions as well.
Adjuncts do not take extensions to indicate their semantic types, like temporal, locative etc. adjuncts. Instead, such semantic information is conceptualized as an inherently lexical feature of certain word classes, like adverbs and prepositions, and thus represented as a subcategory type on part-of-speech base labels.

Comparison with other corpora Unlike other corpora like the YCOE or PPCME, the GeCeG does not use functional extension labels for semantic information like -TMP (temporal). Hence, a PPCME category like ADVP-TMP for 'temporal adverb phrase' would simply be encoded as ADT 'adjunct' (on the clausal level) in the GeCeG. Temporal information can be retrieved from the part of speech labels themselves.

Adjuncts are not used in appositive structures. Instead, adjuncts can be realized multiple times even if one quite clearly specifies more closely the semantic role of another. The reason for this annotation design is that a more fine-grained indication of the relation between similar adjuncts would frequently not be possible in an unambiguous and principled way.


ADT-ADV (adverbial adjunct clause)

Adverbial clauses are adjuncts on the clausal level. Their core functional label is therefore ADT. Like all finite clauses, they include a left periphery layer, which dominates a subordination function, a complementizer head, as well as a subordinate finite clause as its complement, SUB. By the Annotate by Complexity Principle, they thus receive a rightward extension label, -ADV, indicating the specific kind of embedded clause.
The semantic type of adverbial clauses is not indicated with an extension. However, subordinating conjunctions are subcategorized as temporal or locative wherever appropriate, exactly like adverb or prepositional heads.


DIR (direct complement)

Direct complements, labelled DIR, are functions that are directly selected by a predicator. The following syntactic environments require direct complements in the GeCeG:
  • (1) Ordinary transitive verbs select a DIR argument. In this context, they are typically noun phrases marked for accusative case, carrying the semantic role of theme. However, other cases and semantic roles can appear as well.
  • (2) Complements of prepositions are labelled DIR by default irrespective of phrasal category and case. Strictly speaking, DIR is not a clause-level function in this case because it is not immediately dominated by a clausal node. Nevertheless prepositional complements are annotated in this way because the preposition is assumed to be transparent to clause-level relations.
  • (3) Verbs of motion select prepositional phrase DIR arguments where the prepositional phrase expresses the goal of the movement. Examples are go to a place, run into a place etc. The preposition head is typed locative.
  • (4) Other prepositional phrases that are directly entailed by a selecting verb are labelled DIR, for example fight with/against someone.


DIR-DCL (declarative direct complement clause)

Finite complement clauses take the core label DIR as they function as direct complements of a clausal head with the rightward extension DCL, which indicates the type of embedded clause, namely one whose sentence type has declarative force. Like all finite clauses, DIR-DCL occurs with a left-periphery layer. It is usually minimal in that it only contains a complementizer head, COMP, and a subordinate finite clause, SUB. The single most common complementizer in early German is taz 'that.' Note that only that-clauses that are directly selected by a predicator are labelled DIR-DCL in the GeCeG, and not all functions that are formally alike in that they are headed by a complementizer that.

Comparison with other corpora The GeCeG function label DIR-DCL can be directly compared to CP-THT in other CorpusSearch corpora in the vast majority of instances. Some differences arise in the treatment of left-dislocated that clauses or associates of subject expletives.


DIR-NFN (non-finite direct complement clause)

Non-finite complement clauses are labelled by their core functional label, DIR, plus a rightward extension label indicating its non-finite form, -NFN. They are listed as clause-level elements because they are usually immediately dominated by clausal nodes, i.e. they are selected by clause-level heads, normally finite verbs. Non-finite complements are also top-level categories because they contain clause-level elements, at least a subject and a non-finite verbal head. Like all non-finite clauses, they do not contain a left-periphery layer.

Comparison with other corpora Subjects in non-finite clauses must always be indicated in the GeCeG according to the Top-Level Completeness Principle whereas other corpora of the CorpusSearch family, like the PPCME or YCOE, only indicate certain subjects, for example overt accusative subjects in AcI / ECM constructions.

  • Raising and Argument Control
In cases of raising and argument control, the subject of the non-finite clause is unified with its co-indexed overt grammatical function according to the usual guidelines regarding filler gap dependencies / movement.
Note that this annotation scheme does not offer a linguistically justified distinction between raising and argument control since it does not include an inventory of different kinds of movements and traces (for instance an A-movement trace vs. an empty pronominal category). The only difference between raising and control predicates is that the latter but not the former assign a thematic role to the subject. But the GeCeG does not attempt to annotate sentences for their theta relations. Consequently, the analyses for both constructions must be exactly identical. In this way, subjects are universally included in non-finite clauses in a principled way.

  • Arbitrary Subjects
Arbitrary subjects are not controlled by another grammatical function in the sentence, but are understood to be anyone or anything contextually relevant. Arbitrary subjects are indicated with an empty category, labelled *arb*.

  • Accusative Subjects
In accusativus-cum-infinitivo (AcI) constructions, there are overt, accusative subjects in a DIR-NFN function. AcI complements are limited to a handful of matrix predicates, like verbs of perception, causation and commanding. In other contexts, they may be over-literal translations from a Latin source.


IDR (indirect complement)

Indirect complements are labelled IDR. They are used in the following syntactc structures.
  • (1) Most commonly, indirect complements are the argument-structurally lowest functions of predicators in ditransitive constructions and realized as a noun phrase. They express event participants that are not in control of the action, but are affected by it. Their typical semantic roles are thus recipients, benefactives, malfactives, experiencers, addressees etc. In this syntactic context, they co-occur with argument-structurally higher direct complements, typically themes, or predicates. By default, benefactives that are ambiguous between adjunct and (applicative) complement status are labelled IDR (e.g. He baked him a cake = for him). Note that case is not a defining feature of indirect noun phrases – while they are typically marked for dative, they can also be realized by other cases like accusative.

(later: (2) PPs, other categories in ditransitives;
(3)Secondly, indirect complements are also found in transitive constructions.
(a) With impersonal verbs; no quirky subjects in GeCeG, annotated as impersonal or expletive – associate structure instead.
(b) when the direct complement is implied

Comparison with other corpora Indirect complements are only partially comparable to NP-OB2 in the PPCME or NP-DAT in the YCOE. Indirect complements are not categorically restricted to noun phrases, but can be realized by other phrases as well, albeit rarely. Furthermore, they do not necessarily appear in dative case. Nevertheless, for most cases, the label NP-OB2 covers essentially instances comparable to IDR in the GeCeG.


NFMARK (non-finite marker)

The function NFMARK hosts words of the category TO, i.e. non-finite markers like zu, ze, zi etc. Typically, the infinitive that is extended with NFMARK is inflected for dative case in early German.


PRD (predicate)

Predicates, annotated PRD, are categories that are immediately predicated of their subject or linked to them through a copula verb, like be, become, the early German verb heizzan 'be called' and others. Predicates are most commonly realized as noun phrases, adjective phrases, prepositional phrases or non-finite verbs. Noun phrase predicates are normally marked for nominative case.

Comparison with other corpora The GeCeG regards predicates as core clause-level functions. Hence, PRDs have the same ontological status as subjects, adjuncts etc. Other CorpusSearch corpora treat predicates differently. In the PPCME, they are seen as instances of direct objects, NP-OB1. The YCOE attaches a rightward extension, -PRD, to case marked NPs or ADJPs.


SBJ (subject)

Subjects are the most common clause-level function. They are annotated SBJ. In the GeCeG, all top-level categories (except for fragments) must contain a subject. This includes imperatives and non-finite clauses.
It is for the most part unproblematic to recognize subjects in early German. Subjects are the argument-structurally highest function of predicators. Furthermore, there are certain morpho-syntactic properties that regularly coincide in subjects: nominal subjects bear nominative case in finite clauses and accusative case in small clauses (including ECM / AcI constructions). Furthermore, finite verbs agree with their subject in person and number, even though the specific features may depend on the subject type, coordination context etc.


SBJ-DCL (declarative subject clause)

Finite that-clauses that function as subject are indicated as SBJ-DCL. Subject clauses agree with the finite verb in third person singular even when conjoined. A pre-verbal position is not a relevant criterion for the detection of subject clauses. Rather, whenever a subject interpretation of a that-clause is conceivable, it is annotated accordingly. In particular, that means that that-clauses are not interpreted as associates of empty expletives, but as subjects in their own right. In this way, the introduction of empty categories is minimized. Like all finite clauses, SBJ-DCLs include a left-periphery layer.

Comparison with other corpora In the GeCeG, clauses are annotated by function. In contrast, the YCOE or PPCME annotate that-clauses by form, namely as CP-THT in all uses. The GeCeG interprets that-clauses as subjects wherever possible irrespective of their position. The other corpus search corpora do not force a particular interpretation on that-clauses, but they are normally either complements or function as associates of subjects. If there is no overt subject, an empty subject expletive is introduced.


2.3. Nominal-Level Functions

things like MOD, DEF, QNT

Nominal-Level Functions Overview List

The GeCeG uses the following 10 nominal-level functions.

Nominal-Level Label Mnemonic
DEF definiteness
QUANT quantification
POSS possession
MOD modifier
MOD-REL relative modifier clause
MOD-NFN non-finite modifier clause
SBCTV subjective genitive
OBCTV objective genitive
APP appositive
FOREIGN foreign language

Nominal-Level Function Details

APP (appositive)

The appositive function, APP, is used for replications of a clause-level or nominal-level functions embedded within a mother function of the same type. That is to say, both the mother function as well as the appositive fulfill the same grammatical role in relation to the local predicator. The head that appears first in linear order is annotated as a direct daughter of the mother function; the subsequent element heads the appositive.
Appositives receive as their rightward extension the entire label of the function that they are embedded in. For example, APP-SBJ designates a phrase that is appositive on the head of a subject; APP-POSS stands for a phrase appositive on a possessor etc. In addition, appositives are subject to the Annotate by Complexity Principle so that they are additionally annotated for types of clauses they may dominate. For example, APP-DIR-DCL stands for a declarative clause that is appositive on a direct complement etc.
When appositves are extraposed, they form a gap and contain only a numerically indexed TAG marker, as required by the guidelines on filler-gap / movement dependencies. The overt material of the appositive will be placed in a generic discourse function, DISC. It is positioned as closely as possible to its extraction site in the syntactic tree. The appositive gap is placed immediately after its head.

Comparison with other corpora Other CorpusSearch corpora like the PPCME and YCOE use the label PRN 'parenthetical' for functions that correspond to APPs in the GeCeG. However, the other corpora annotate a wide range of other constructions with PRN as well, like genuine parentheticals (functions that take scope over entire utterances, e.g. asides and clause-interal quotatives), right-node raising, bare reason adjuncts etc. The GeCeG annotation scheme for APPs is more restrictive. Furthermore, the PPCME and YCOE always use PRN as an extension to the right of some other base label. In the GeCeG, APP occurs as the core label.

Common uses of appositives include the following: Firstly, a non-clausal appositive may give specific detail on a nominal head, like names, titles, epithets, social roles etc. Secondly, expletives with clausal associates (e.g. it is nice that you are here, it pleases him to see you etc.) are annotated as extraposed clauses appositive on light elements like pronouns or demonstratives. For more information on clauses that do not occur as appositives on dummy subjects, see subject clauses.

Comparison with other corpora Expletive - clausal associate constructions are annotated differently in the GeCeG and other CorpusSearch corpora like the YCOE and PPCME. In the latter coprora, expletives like it (PRO) head a subject function that is coindexed directly with its associate clause. There is no trace / extraposition. Where there is only a clause but no overt dummy subjects, an empty expletive (*exp*) is introduced in a subject function. For other overt place-holders, like that but also other elements, a clause is annotated as an extraposed appositive (e.g. CP-THT-PRN-1 etc.) with an *ICH* trace in the subject. In contrast, the GeCeG parses clauses with overt dummy subjects of any kind (it, that etc.) as appositives and clauses without overt dummy subject by their function directly without the introduction of empty expletives.


DEF (definiteness)

Definiteness, labelled DEF, describes the degree of identifiability of a potential nominal head referent. In early German, identifiability is almost always definite, i.e. a specific referent is conceptualized as known by both speaker / author and addressee / reader in a given situation. Heads of DEF are therefore usually simple demonstratives, DS, or complex demonstratives, DD. Later, the word one, ONE is increasingly used as an indefinite article, i.e. for situations in which a non-specific or generic referent is conceptualized as unidentified by both speaker / author and addressee / reader. It is not always easy to distinguish between the quantificational use of one and its use as an indefinite article. Researchers interested in this topic should search for both functions.


FOREIGN (foreign language )

The function FOREIGN always contains a sequence of foreign words. The sequence must be at least three words long without intervening native material. Foreign utterances that are one or two words long are always functionally analyzed (e.g. as modifier and nominal head etc.). The specific language source of the nan-native material (Latin, French etc.) is not indicated on the functional label. However, non-native material in early German is virtually always Latin.
The syntax of the foreign words contained inside FOREIGN is not analyzed further, but left flat. Hence, sequences of foreign words are an exception to the Headedness Principle: all foreign words are immediately dominated by their mother function; they all appear to be heads of FOREIGN.
Furthermore, the function FOREIGN must always be the single child of a clausal or nominal function, like adjunct, modifier, subject etc. It cannot function as a root node or appear with another sister function or head. This, too, is a violation of the Headedness Principle: A clausal category can dominate only the function FOREIGN and then not dominate any head at all. By default, foreign titles are interpreted as subjects.

Comparison with other corpora Foreign language citations are not differentiated in the GeCeG. In contrast, the PPCME, for example, makes a difference between French, Latin and other sources. In general, the syntactic environment of foreign material is stricter in the GeCeG than other CorpusSearch corpora. The function FOREIGN must be the sole daughter of a clausal or nominal function. In contrast, the PPCME is much more liberal and allows LATIN or FRENCH etc. as root nodes, inside noun phrases, on the clausal level, in quotation phrases and in other syntactic contexts.


MOD (modifier)

Modifiers provide additional information on non-clausal heads. Modification is a fairly versatile category with applications in various domains. Modifiers have their own label, MOD, to distinguish them from clausal adjuncts, ADT. Prototypical modifiers are descriptions of nominal heads, most commonly with adjectives, prepositional phrases or relative clauses. The modification function is also used for modifiers in potential compounds whose component words are spelled separately in the text edition. Genitive phrases are never annotated as modifiers. Less common cases of modifiers include, among others, modification of adjectives, for example with degree adverbs, and formation of complex subordinators, for example with the adverb so.


MOD-NFN (non-finite modifier clause)

Non-finite modifier clauses are labelled by their core functional label, MOD, plus a rightward extension label indicating its non-finite form, -NFN. They are listed as nominal-level elements because they are contained inside clause-level functions, like subjects, complements or adjuncts, and relate to nominal heads, typically common or proper nouns. Non-finite modifiers are also top-level categories because they contain clause-level elements, at least a subject and a non-finite verbal head. Like all non-finite clauses, they do not contain a left-periphery layer. The subject of non-finite modifier clauses is virtually always controlled by the nominal head it modifies. Accordingly, the subject of the non-finite clause and the nominal head are coindexed with a unification marker in accordance with the guidelines regarding filler gap dependencies / movement.

Comparison with other corpora The GeCeG label MOD-NFN is often analogous to PPCME reduced relative clauses, RRC, or to YCOE participial phrases PTP. While English reduced relative clauses are virtually always positioned immediately after the nominal head they modify, German non-finite modifier clauses can also occur as complex pre-nominal attributes. Furthermore, RRCs and PTPs never include a subject function while non-finite modifier clauses in the GeCeG always include a subject in accordance with the Top-Level Completeness Principle.


MOD-REL (relative modifier clause)

The label MOD-REL is used for finite relative clauses modifying a nominal head. Like all finite clauses, they have a left-periphery layer as described under Periphery Functions. The periphery is headed by a complementizer, which takes a subordinate clause complement. In addition, the periphery includes a relative topic, which is most commonly headed by a simple demonstrative functioning as a relative pronoun in early German. It can, however, also be empty. Conversely, the complementizer is normally empty, but can also sometimes be overt. The relative topic is annotated with the generic discourse function, DISC. The subordinate clause includes the relativized grammatical function, which is most commonly empty but can also include a resumptive element. The relative topic and the relativized element are unified through the use of co-indexed TAG markers. The relativized element can be immediately dominated by the subordinate clause but it can also be embedded more deeply in the syntactic structure. The head noun and the relative topic are supposed to have an anaphoric relation (operator movement analysis of relative clauses). Their coreference can be unambiguously inferred from the structural fact that relative clauses are always sisters of their antecedent and is therefore not implicitly encoded, for example through some co-indexing mechanism. The GeCeG does not make a distinction between restrictive and non-restrictive relative clauses.

Comparison with other corpora Relative clauses are handled essentially in the same way in the GeCeG and in the PPCME or the YCOE. The only differences come from the way movement is encoded in general in the different corpora, from the form of empty complementizers and from the label of the relative topic, i.e. DISC in the GeCeG and WNP-#, WPP-# etc. in the other corpora.


OBCTV (objective genitive)

The label OBCTV stands for 'objective genitive.' The function is used for genitive noun phrase complements of nominal heads. It is roughly the nominal level equivalent of clause-level DIR. Hence, the label is used whenever a verb – object paraphrase of the head noun – genitive phrase constituent is possible. OBCTV is the default annotation for genitive phrases. The applicability of OBCTV is tested for before the applicability of any other possible genitive labels is considered.


POSS (possession)

The possession function, POSS, indicates a possessor in relation to a possessed nominal head.
Possessive pronouns project POSS by default. They are heads of other functions only rarely.
Genitive phrases are not commonly possessors. Rather, they are annotated as objective genitives by default and secondly, if an objective reading is impossible, as subjective genitives. Only when these two functions are not plausible will genitive phrases be annotated as POSS. In these cases, they are often possessors only in a metaphorically extended sense.


QUANT (quantification)

The quantification function, QUANT, indicates the quantity of the nominal head it relates to. It answers the question, "for the nominal head x, how many x?" The function can be headed by numerals, the word one or, most commonly, by quantifier adjectives.


SBCTV (subjective genitive)

The label SBCTV stands for 'subjective genitive.' The function is used for genitive noun phrases that relate to nominal heads roughly in the same way that subject functions relate to clause-level heads. Thus, the label is used if a subject – verb paraphrase of the head noun – genitive phrase constituent is possible. It is not, however, a default option. First, an objective genitive reading is tested for (see OBCTV). If an objective genitive annotation is not a viable choice, the applicability of a subjective interpretation of the genitive is attempted only in a second step.


2.4. Periphery Functions

The GeCeG includes a hierarchically distinct, structural clause embedding domain, called left periphery, for all embedded finite clauses. It is sandwiched between the mother node of adjunct, complement or modifier finite clauses and the clausal material of the actual finite subordinate clause. The left-periphery layer corresponds roughly to projections from between embedded CPs to IPs in the terminology of mainstream generative syntax. The left periphery is always headed by an overt or empty complementizer, COMP, and always contains a finite subordinate clause, SUB, as its complement to the right. There are another two, optional periphery functions, which can precede the complementizer head: a generic discourse category, DISC, and subordination, SBORD. This is illustrated in the graphic below:
Left periphery categories
A separate periphery layer is never included for matrix clauses.
Likewise, a separate periphery layer is never included for non-finite clauses.
The following functions always contain a left periphery: DIR-DCL, SBJ-DCL, ADT-ADV, MOD-REL, DISC-REL. See the respective functions for more details and examples.
DISC can occur as a left-periphery function, but also in other contexts, and is therefore discussed in its own section. See generic discourse function for details.

Comparison with other corpora The GeCeG left periphery can be directly compared to CPs in other CorpusSearch corpora like the YCOE or PPCME. However, matrix clauses, like direct questions never include a left periphery in the GeCeG but may do so in the other corpora.

Periphery Functions Overview List

The GeCeG uses the following 2 periphery functions.

Periphery Label Mnemonic
SBORD subordination
SUB finite subordinate clause

Periphery Function Details

SBORD (subordination)

The subordination function, labelled SBORD, introduces finite subordinate clauses, in particular adverbial clauses, ADT-ADV. Where present, it usually appears in first position on the left-periphery level. The subordination function is always headed by a subordinating conjunction, SBNJ. The subordinator is usually a simplex form (e.g. when, as, though) but can also occur with other modifiers like so, creating complex subordinators. Specifically, where Schützeichel (2006) lists combinations of a head subordinator with a modifier, they are bracketed as one SBORD constituent. When a subordinating function is present, COMP is usually empty. The subordination function is required since by the Headedness Principle, subordinating conjunctions cannot themselves be heads of the embedded left periphery, but must have a grammatical relation to complementizers.

Comparison with other corpora There is no subordination function in other CorpusSearch corpora, like the PPCME or YCOE. The PPCME treats subordinators as prepositions, which take adverbial clauses as complements. The YCOE is more similar to the GeCeG in that it places subordinators, likewise annotated as P, directly in front of complementizers. The GeCeG has a designated part of-speech label for subordinators and includes subordination functional brackets around them. The three annotation schemes are illustrated below for the adverbial clause when he came with the finite subordinate clause he came left unanalyzed.
Subordination Differences Example HD


SUB (finite subordinate clause)

Finite subordinate clauses are complete top-level categories as they immediately dominate clause-level functions and at least a subject and a head. However, they are never root nodes. Instead, they occur exclusively as a left-periphery function, i.e. they are always embedded inside finite complement, modifier or adjunct clause functions. They are always the complement of an overt or empty complementizer, annotated COMP, which heads the embedded left-periphery layer.


2.5. Coordination Function

There are many different kinds of coordination structures: there can be only one, or an arbitrarily large number of conjuncts; coordination relations may hold between heads, phrases, embedded or matrix clauses; clausal conjuncts may be complete or incomplete requiring reconstruction from a previous clause; conjunction heads can be overt or empty. The GeCeG includes only one, global coordination function for all these and other cases of coordination without any exceptions, labelled COORD.

Comparison with other corpora The treatment of coordination structures in the GeCeG is fundamentally different from annotation schemes of coordination in other corpora of the CorpusSearch family. Users familiar with, for example, the PPCME or YCOE should pay close attention to the relevant differences to make sure they find all structures they are interested in.

The Global Coordination Scheme

The GeCeG uses only one coordination structure for all cases of coordination. This universal coordination scheme works as follows: a mother function of a particular type immediately dominates its head as well as any number of other functions in relation to that head. Coordination, COORD, is just like any other function in that respect, i.e. conjuncts are sisters of the mother node’s head as well. The coordination structure must necessarily be headed. Heads of COORD are always conjunctions, CONJ and conversely, all CONJs project a COORD function. In cases where a coordination structure is needed but no overt conjunction head is available, an empty conjunction is introduced as a head. The complement of CONJ is another function, which, crucially, is of the same type as the higher mother node dominating the conjunct.
The graphic below illustrates this coordination scheme. f1 to fk stand for typed functions, h1-3 are the heads of the mother, the coordination structure and its complement respectively. Note that the mother function and the conjunct function are essentially of the same type (in red).
Coordination Scheme Extraposed coordination functions work in the same way except that they are formally generic discourse functions associated with an empty coordination function in its place of interpretation through the use of co-indexed unification tags. See the guidelines on filler-gap / movement structures and DISC for general information on extraposition or extraposed coordination structures for more details and examples.

Comparison with other corpora The global coordination scheme in the GeCeG works differently from the general coordination guidelines of other CorpusSearch corpora. Firstly, word-level conjunction is annotated flat in the major CorpusSearch corpora but falls under the general coordination scheme in the GeCeG. Secondly, for phrase-level conjunction, the other CorpusSearch corpora repeat the mother label in cases where it contains a coordinate structure in an attempt to imitate phrase-level adjunction. The repeated label contains the first conjunct. The second and following conjuncts are labelled CONJPs and they are structurally sisters of the first conjunct. Their heads are conjunctions if they are overt. In contrast, the GeCeG does not repeat the mother label but, again, uses the global coordination scheme. Hence, phrasal conjuncts are modelled as a function immediately dominated by the mother node. The differences between the two annotation schemes is illustrated below for the direct object my mother and her dog.
Coordination Differences Examples HD Phrasal conjunction may be less adequate from the point of view of linguistic theory in the GeCeG, but it makes controlling and searching for coordinate structures considerably easier with the functionality provided by the CorpusSearch program.


Subjects in Conjoined Non-finite Clauses

The indication of subjects is identical in non-conjoined and coordinated non-finite clauses. For details on non-conjoined non-finite clauses, see non-finite complements, non-finite modifiers and non-finite adjuncts. In cases of coordinated non-finite clauses, the overt higher subject is unified with raised and controlled empty subjects of all non-finite conjuncts with one numerical index in accordance with the usual guidelines regarding filler-gap dependencies / movement.


Missing Subjects in Conjoined Clauses


2.6. Discourse Functions

The GeCeG includes generic discourse functions, whose core label is DISC.
DISC is used whenever a syntactic category functionally belongs to one constituent, but is positioned somewhere else in the syntactic tree where it does not at the same time relate as a grammatical function to the local predicator. For example, a direct object may belong to a lower predicate, but appear in a higher clause, where it is not interpreted as a subject, object, adjunct etc. In derivational terms, DISC designates the target of an A-bar moved grammatical function.
In accordance with the guidelines regarding filler-gap dependencies / movement, the overt material in DISC is associated with an empty (or rarely resumptive) grammatical function through the use of co-indexed unification markers. The unification marker always occurs as the first element of the generic discourse function before the dislocated overt material. That is sketched in the graphic below:
Order in DISC
The name of the function was chosen for the following reason: Constituent displacement of the kind indicated with DISC presumably happens for extra-grammatical or non-formal reasons, like the organization of information, style, parsing efficiency, discourse-optimization etc. Hence, in most cases, DISC can be regarded as some information-structurally designated position, like focus or topic. However, since labels like "focus" and "topic" are neither free of theoretical presuppositions nor used consistently across different studies and schools, nor easy to annotate, nor sufficiently fine-grained, the GeCeG uses the broader and more-theory neutral term "generic discourse function" instead.
DISC is indicated only if displacement of a syntactic function is unambiguous. That means that at least one element must intervene between the displaced category and the domain that it belongs to functionally. In addition, all wh-constituents in direct questions are annotated as DISC even for cases of local extraction without an intervener between the wh-element and its associated gap.
DISC is subject to the Annotate by Complexity Principle. Hence, it can occur with various rightward extensions if it contains clausal material. In addition, displaced coordination functions receive a rightward extension. The permissible types of DISC are listed in detail below.
DISC is always attached as closely as possible to its extraction site. That means that, ideally, the DISC filler and its gap should occur in the same function, and wherever this is not feasible, the number of intervening functions should be minimized.

Comparison with other corpora The GeCeG employs only one label for all cases of unification of a grammatical (e.g. object) with a non-grammatical (e.g. topic) category, the generic discourse function, DISC. Other CorpusSearch corpora, like the YCOE or PPCME, use much more intricate mechanisms to indicate various kinds of A-bar movement. Specifically, DISC can be compared to the following constructions in the YCOE or PPCME:
(1) Generally, the YCOE or PPCME use a numerical index on the outer syntactic bracket of the displaced constituent and a co-indexed trace inside the empty function. For wh-moved elements, the syntactic bracket indicates the form of the moved constituent, e.g. WNP-index for wh-noun phrase etc., and the trace is *T*. The form information can be retrieved in the GeCeG by looking at DISC functions dominating wh-elements.
(2) For other kinds of A-bar movement, i.e. all cases of fronting and extraposition, some syntactic brackets are stripped of their extensions, e.g. NP-index, for A-bar moved direct objects, otherwise indicated as NP-OB1, and the trace is *ICH* ("interpret constituent here"). (Incidentally, this design choice sometimes results in unrelated structures with very similar annotations. For example NP-1 could be a topicalized direct object, but also the associate of a there expletive. It is therefore relatively complicated to write search queries where all and only A-moved topics are ignored, for example.) In the GeCeG, a researcher interested in a partiular A-bar moved category should look not for the category of interest with a numerical index, but rather for DISC functions whose unification marker is co-indexed with the relevant category.
(3) Other A-bar moved categories are annotated with their syntactic brackets fully intact and a numerical index associated with an identical index on a *ICH* trace. For example, an A-bar moved that-clause would be indicated as CP-THT-index. In the GeCeG, such structures can be searched for just like any other instances of the generic discourse function. However, for convenience, clausal DISCs additionally indicate their specific type with rightward extensions. Hence, a topicalized declarative that clause would be annotated as DISC-DCL in the GeCeG.
(4) The YCOE and PPCME indicate left-dislocation with a resumptive element with the extension pair -LFD and -RSP respectively. In contrast, the GeCeG uses DISC for dislocated elements with a resumptive element as well. The only difference is that the grammatical function that DISC is associated with dominates material other than just the unifcation marker.
The intricate annotation system for A-bar movement in the YCOE and PPCME take a toll on their annotation accuracy. For some examples, click here. The GeCeG system leads to fewer inconsistencies.

List of Uses of the Generic Discourse Function

DISC can be a clause-level function when it occurs inside a top-level category. This happens in the following cases:

  • (1) DISC is used for all instances of unambiguous unification between an empty grammatical function of a lower top-level function / clause and its overt realization in a structurally higher top-level function / clause, where the displaced constituent does not fulfill a grammatical function in the higher clause at the same time. In derivational terms, these are cases of long-distance A-bar movement, like clause-crossing topicalization, contrastive focusing, scrambling etc. Frequently, DISC will occur clause-initially in these cases. However, it is also possible for a dislocated element to be realized elsewhere within a clause.

  • Examples

  • (2) DISC is used if a displaced unit is unified with a grammatical function even in the same top-level function / clause if its displacement is unambiguous. Then, DISC's closest position to its gap may be immediately under a top-level category and thus counts as a clause-level category. In derivational terms, these are instances where DISC functions as the target of local A-bar movement. Examples are stranding of elements that select the displaced constituent as their complement or extraposition of nominal-level categories like nominal-level coordination structures, relative modifiers etc. within a clause.

  • Examples

  • (3) DISC designates focused wh- constituents in direct questions. This is the only case which may not unambiguously involve displacement since nothing may intervenes between DISC and its associated gap in cases of local extraction. For details and examples, see direct questions
( later: (4) left-dislocation; My brother, he is always sleeping; triangle, no special label in GeCeG, case of topicalisation; )

DISC can be a nominal-level function when it is embedded inside a clause-level category. This is the case in the following contexts:

  • (5) DISC contains the material of a dislocated nominal-level function, like extraposed appositives, displaced coordination structures etc. However, if only a local element intervenes, DISC's closest position to its gap may be under the immediate dominance of the same mother function. For example, both DISC and its gap may occur under the same direct object, adjunct etc. In this case, DISC itself will be nominal-level function. In derivational terms, these are cases of A-bar movement that unambiguously occurs within a nominal phrase. See the following function for more details and examples, appositives.

(later: (6) displaced coordinations)

DISC can be a left-periphery domain function. In the parlance of main-stream generative syntax, this translates to the specifier position of embedded CPs. See periphery functions for details. It is found in this position in the following cases:

  • (7) DISC designates relative topics, usually headed by relative pronouns, although they may also be empty. For details and examples, see relative clauses.

(later: (8) embedded questions)

Discourse Functions Overview List

The GeCeG uses the following 4 generic discourse functions.

Discourse Label Mnemonic
DISC generic discourse function
DISC-COORD extraposed coordination function
DISC-REL extraposed relative clause
DISC-DCL displaced subordinate declarative clause

Discourse Function Details

DISC (generic discourse function)

DISC represents non-clausal functions that are unified with a grammatical function but not themselves a grammatical function in relation to the local predicator. Instead, material is positioned in DISC for discourse-structural reasons. See the general discussion above for more information.


DISC-COORD (extraposed coordination function)

DISC-COORD is the label used for extraposed coordination functions. The conjoined material inside the generic discourse function obeys all the constraints imposed on coordination functions in general. The rightward extension -COORD is actually redundant as it can be directly inferred from the mother function of its co-indexed unification tag. Nevertheless, the extension is included for convenient search query scripting. As with all other DISC functions, the implicit assumption is that its content has been displaced not for formal but for discourse-structural reasons. See the general discussion above for more details.


2.7. Disfluency Function

disfluencies are things like false starts, breaks; also: asides; which are parentheticals in other corpora, can be properly handled like this in the syntactic annotation

3. Filler-Gap / Movement

Languages display structures in which one element occurs on one hierarchical level, for instance a main clause, but is also interpreted in another domain, for instance an embedded clause. An example of this phenomenon is shown in the following sentence, in which the first bracketed constituent is also interpreted in all positions indicated with an underscore: [Which soldiers ] did the general convince _ [ _ should scrub the floor [ _ naked] [without warning _ [ that a TV company would film _ ]]]? Such cases of displacement are most commonly referred to as ‘filler-gap dependencies’ in constraint-based theories of syntax or as instances of ‘movement’ in transformational frameworks.
There are various dimensions along which filler-gap dependencies can be differentiated: The filler can be a discourse or grammatical function (A-bar vs. A movement in derivational accounts); the gap could be in the same, local domain as the filler or have a long-distance relation to it; there may or may not be a wh-item in the filler; the gap may be unpronounced, which is typical, but may also be realized by a light resumptive functional item, like a pronoun; the filler may function have a grammatical relation to the predicators of both its extraction site and its local domain, or it may have a grammatical function only in its extraction site, but fulfill a discourse role in its displaced position (A vs. A-bar movement in derivational accounts); if a filler is associated with more than one gap, all gaps could appear in argument domains or at least one gap could appear in an adjunct domain (cyclic movement vs. operator movement with anaphoric co-indexing in derivational accounts) etc. The GeCeG uses only one, global annotation scheme for all instances of filler-gap dependencies / movement without exception.

Comparison with other corpora The treatment of filler-gap relations / movement in the GeCeG is fundamentally different from annotation schemes for such structures in other corpora of the CorpusSearch family. Users familiar with, for example, the PPCME or YCOE should pay close attention to the relevant differences to make sure they find all structures they are interested in.

The Global Filler-Gap / Movement Scheme

The co-dependency between a fillers and its gap(s) is indicated by the use of so-called unification markers. They are formally identical to a bracketed part-of-speech – word form pair. The label of unification markers is TAG, which stands for ‘tag’ in the sense of ‘label, index, marker.’ The TAG marker is always extended by a numerical index, e.g -1, -2 etc. It is the only instance of a label that receives a numerical index in the entire GeCeG. The form of the unification marker is 0. This is the only function of this symbol in the GeCeG. Both the filler and its gap(s) immediately dominate a unification marker. It always appears in first position. Normally, the gap does not contain any other material apart from the unification marker, but it can also rarely co-occur with a functional head, namely resumptive elements. All functions that are in a filler-gap relation / form a chain have the same numerical index on the unification marker. In this way, the unification markers capture the co-dependence between displaced / moved constituents.
The graphic below illustrates the global filler-gap / movement scheme. The function f1 designates the mother function, e.g. a clausal node, immediately dominating a filler (as well as a head and any number of other functions). It also dominates a gap, either immediately or mediated through some other function, f3 to fi. The gap may or may not contain an overt resumptive head, h3. Both the filler and the gap contain a unification marker in their first position with numerical indices a and b. Co-dependency between the two elements is achieved by setting a equal to b.
Filler Gap Dependencies

Comparison with other corpora The global annotation scheme for filler-gap / movement dependencies in the GeCeG is quite different from the way such structures are handled in the other CorpusSearch corpora. Firstly, corpora like the PPCME or YCOE use different kinds of traces to imitate different kinds of movement standardly assumed in transformational syntax. For example, they mark A-movement other than ordinary passive as *-index, wh-movement as *T*-index and A-bar movement as *ICH*-index. In contrast, the GeCeG has only one form of the unification marker, 0, and all 0s in the corpus are used for this purpose. Secondly, the co-indexing mechanism in other CorpusSearch corpora is applied to pairs of traces that are immediately contained inside functional brackets and the functional brackets of the filler. The GeCeG on the other hand co-indexes only TAG labels. All syntactic brackets are kept free from numerical indices. Thirdly, corpora like the YCOE and PPCME use different indices for each member in a chain of gaps whereas the GeCeG uses identical indices for all co-dependent gaps. For example, a sentence like What was to be done? would have the following rudimentary structure in the former corpora: (WNP-1 (WPRO What)) ... (NP-SBJ-2 *T*-1) was ... (NP-SBJ *-2) to be done? with two different indices (in red and blue), but a structure like this in the GeCeG: (DISC (TAG-1 0) (WPRO What)) ... (SBJ (TAG-1 0)) was ... (SBJ (TAG-1 0)) to be done? with only one numerical index. Where a chain analysis is deemed unfeasible, operator movement is used instead in the other CorpusSearch corpora while the GeCeG uses the global filler-gap / movement scheme for such instances as well. Finally, resumptive elements are not subsumed under the co-indexing mechanism of the CorpusSearch corpora but fall under the global gap-filler / movement scheme of the GeCeG.

List of Filler-Gap / Movement Dependencies in the GeCeG

Filler-gap / movement dependencies are annotated ...

  • (1) ... between a relative pronoun topic and its function in the relative clause. For details and examples see relative clauses.
  • (2) ... between a focused wh-element and its lower grammatical function in direct questions. For details and examples, see interrogative matrix clauses
  • (3) ... for all other cases of unification of a generic discourse function with another syntactic function / A-bar movement, like topicalization, extraposition etc. See discourse functions for details.
  • (4) ... between a structurally higher argument, often the subject, and a structurally lower, usually empty, controlled or raised grammatical function of a non-finite clause, also often the subject. For details and examples, see the non-finite clause functions, DIR-NFN, MOD-NFN, ADT-NFN, and coordinated non-finite clauses.
  • (5) ...

4. Empty Categories

all empty categories are enclosed *x*; list of all empty categories, and where to find them
some are heads, *cpz*, *cnj*; also 0 technically speaing, but not a linguistically real element, part of formalism;
others are not heads, but should be like a function; *imp*, *pro*, *exp* etc.. These are therefore embedded in a pos label, EC.

5. Footer

includes things inside wrapper parenthesis

5.1. Gloss

The footer contains an English gloss for every overt German word in the token. Glosses are put inside curly brackets introduced with the string GLOSS:. Their base label is CODE.
Glosses consist of rough English translations of every German word in linear order. The English words are separated with underscores. They are normally of the same category as the German source and preferably etymologically related to them if possible. Plural s, past ed genitive ’s, s’ etc. are indicated where a similar feature is found in German. In general, however, inflectional features are not indicated in the gloss. In particular, the oblique cases are not glossed, for instance as a preposition, but translated in their English citation form. Inflections should be retrieved from the inflectional features on the base part of speech labels in the syntactic tree instead. The first word of the English gloss is capitalized as are all proper names. This convention is followed even if one might feel that the token is really an inherent part or direct continuation of the thought of the previous material. Hyphens are used to mark compounding, fusion in the German words (e.g. nechame glossed as not-came) or translations of one German word with multiple English words (e.g. hiez as was-called). The punctuation, commas, periods etc., in the English gloss mimics the convention used in the edition of the early German text. Punctuation marks are strung directly to their preceding word (e.g. A_B,_C) even if the German text edition shows that the punctuation mark is clearly separated with spaces to the right and to the left (e.g. A_B_,_C). Empty categories are not represented in the gloss but should be reconstructed from the information provided in the syntactic tree. Foreign material is translated into English in the gloss wherever appropriate.
Users of the GeCeG should keep in mind that the glosses are intended mainly as an aid to quickly comprehend and analyze a sentence. There may be better translation options; some of the glosses may be inaccurate. The glosses are not meant to be suggestions for citations of the German tokens, for example in academic papers.

Comparison with other corpora Glosses are not found in other CorpusSearch corpora like the YCOE or PPCME.

(1) (CODE {GLOSS:Holy_Paul_promised_those._who_in_his_times_expected_the_doomsday.

A typical English gloss. The gloss is labeled CODE and contained in curly brackets, which are introduced with GLOSS:. Underscores separate every English word.

5.2. Latin

Many Old High German texts are directly dependent on a Latin source. The Latin material is often identifiable exactly as it appears in the original manuscripts along with the early German text. Modern standard editions therefore often print the Latin source as well as the early German translations. Wherever this is the case, the Latin original has been included in the footer of the token immediately after the English gloss. The Latin material is put into curly brackets, introduced by the string LATIN: and separated by underscores where the text edition shows spaces. Its part-of-speech label is CODE. The GeCeG text pages should be consulted to find out whether or not any given text includes Latin source material. The category "Latin" gives the relevant details and background information. Texts with a Latin source have a Latin CODE node for all sentences except for tokens that consist themselves exclusively of Latin words, like titles, citations, excipits etc. The Latin text is cut up or extended across several sentences to fit the German content as well as possible. German tokens that are not translated from the Latin as printed in the text edition show [no_direct_Latin_source] or some other comment in the footer. Every effort has been made to allign the Latin and German material as closely as possible. However, some alignments may not be optimal and researchers might want to check the Latin of the preceding and following tokens of critical examples as well.

Comparison with other corpora Latin source material is not found in other CorpusSearch corpora like the YCOE or PPCME.

(1) (CODE {LATIN:Qui_peregi_quondam_carmina_

A typical Latin source. The Latin material is labeled CODE and contained in curly brackets, which are introduced with LATIN:. Underscores separate every element that is enclosed between spaces in the text edition used.

(2) (CODE {LATIN:[no_direct_Latin_source]})

A German token that is not directly dependent on the Latin source as printed in the text edition.

5.3. Identifier

Identifiers are found at the end of the footer of every token. They are labelled ID and allow cross-referencing the token with the text edition that the electronic parsed file is based on. All Identifiers consist of a short title of the text followed by a comma and a sequence of numbers or other specifications indicating page, line, token number or some other indication of the position of the token in the text edition. For the specific short title and the meaning of the subsequent number, see the description of the texts used in the GeCeG under the rubric "ID."

(1) (ID BoethI,5.11.4)

This is a typical identifier. The short title BoethI stands for Boethius’ ‘De Consolatione’ Book 1. The three numbers represent page, line and token number respectively, i.e. this token comes from page five, line eleven and is the fourth item in the text file. These pieces of information can be found under Texts, in this specific case, under the rubric "ID" for Notker’s Translation of Boethius’ ‘De Consolatione Philosophiae’.