Title: Typometrics: from treebanks to typological universals
The emergence of treebanks of more than 70 languages in the same annotation scheme with the Universal Dependencies (UD) project puts empiricism in syntactic typology on a new footing. The analysis of a treebank not only confirms the existence or the absence of certain syntactic constructions on large quantities of natural text, it also makes it possible to measure their frequencies.
First, it becomes possible to compare the typological family trees of the language families in synchrony, based on the presence of certain typological features, with automatic hierarchizations, using for example a dendrogram of the frequency similarity of syntactic links. This compares the actual use of constructions in different languages.
Second, UD treebanks can be transformed to generalize qualitative syntactic universals to Greenberg: generally, typological universals declare or can be interpreted as the impossibility (or statistical rarity) of languages with certain properties. For example, Greenberg’s Universal 6, « All languages with dominant VSO order have SVO as an alternative or as the only alternative basic order. » excludes the existence of a language with only the VSO word order. I will show how a new form of visualization of word placement trends, in a scatterplot, allows an interpretation as quantitative universal — whose special case are qualitative universals.
The presentation will also show the necessary preparations for a typological interpretation of the UD annotations. In particular, treebank annotation and transformation tools will be presented. The study of word order requires a slightly different surface syntax annotation than the UD analysis. We will present an alternative annotation format, SUD (Surface-syntactic Universal Dependencies), on which our quantitative typology (or typometrics) studies are based, as well as the UD <-> SUD conversion tools.