PCML: Parsed Corpus of Modern Lezgi

About the PCML

The PCML, the Parsed Corpus of Modern Lezgi, is a preliminary attempt at creating a growing corpus of syntactically annotated Lezgi texts. The annotation of the texts takes the historical English parsed texts, such as the PPCMBE, as a starting point, and attempts to follow its annotation guidelines as close as possible.

The Lezgi language belongs to the North-East Caucasian family. See the Ethnologue for details on the language. It is a head-final language, but not so strict as Turkic languages.

Annotation efforts

Efforts are on the way to annotate a number of texts from different sources. The first text is a transcription of an oral tale.

The following steps are taken in the annotation process:

  1. Breaking up into sentences (FLEX)
  2. Tokenization (FLEX)
  3. Interlinearisation and morphological tagging (FLEX)
  4. Transformation from Flex to FoLiA (automatically using Cesax)
  5. Transformation from FoLiA to Psdx (automatically using Cesax)
  6. Dependency parsing (uses Maltparser trained on related language)
  7. Dependency-to-constituency conversion (done within Cesax)
  8. Constituent parse correction (manual process within Cesax)

Individual texts

There is a repository of syntactically annotated xml texts available.







E.Komen@ru.nl | Last update: