pyglottolog Python API
Links
pyglottolog on PyPI: https://pypi.org/project/pyglottolog
pyglottolog on GitHub: https://github.com/glottolog/pyglottolog
Issue tracker: https://github.com/glottolog/pyglottolog/issues
pyglottolog.Glottolog
Most of the Glottolog’s data can be accessed through an instance of pyglottolog.Glottolog
.
- class pyglottolog.Glottolog(repos='.', *, cache=False)[source]
API to access Glottolog data
This class provides (read and write) access to a local copy of the Glottolog data, which can be obtained as explained in the README
- Parameters
cache (
bool
) –repos – Path to a copy of https://github.com/glottolog/glottolog
cache – Indicate whether to cache Languoid objects or not. If True, the API must be used read-only.
- repos: Path
Absolute path to the copy of the data repository:
- tree: Path
Absolute path to the tree directory in the repos.
- references_path(*comps)[source]
Path within the references directory of the repos.
- Parameters
comps (
str
) –
- iso[source]
- Returns
clldutils.iso_639_3.ISO instance, fed with the data of the latest ISO code table zip found in the build directory.
- property ftsindex: Path
Directory within build where the FullTextSearch index is created.
- Return type
pathlib.Path
Accessing configuration data
Configuration data in https://github.com/glottolog/glottolog/tree/master/config
can be accessed conveniently via the following properties of
pyglottolog.Glottolog
:
See configuration data for details about the returned objects.
Accessing languoid data
- property Glottolog.glottocodes: Glottocodes
Registry of Glottocodes.
- Return type
- Glottolog.languoid(id_)[source]
Retrieve a languoid specified by language code.
- Parameters
id – Glottocode or ISO code.
id_ (
typing.Union
[str
,pyglottolog.languoids.languoid.Languoid
]) –
- Return type
- Glottolog.languoids(ids=None, maxlevel=None, exclude_pseudo_families=False)[source]
Yields languoid objects.
- Parameters
ids (
typing.Optional
[set
]) – set of Glottocodes to limit the result to. This is useful to increase performance, since INI file reading can be skipped for languoids not listed.maxlevel (
typing.Union
[int
,pyglottolog.config.LanguoidLevel
,str
,None
]) – Numeric maximal nesting depth of languoids, or Languoid.level.exclude_pseudo_families (
bool
) – Flag signaling whether to exclude pseud families, i.e. languoids from non-genealogical trees.
- Return type
typing.Generator
[pyglottolog.languoids.languoid.Languoid
,None
,None
]
- Glottolog.languoids_by_code(nodes=None)[source]
Returns a dict mapping the three major language code schemes (Glottocode, ISO code, and Harald’s NOCODE_s) to Languoid objects.
- Return type
typing.Dict
[str
,pyglottolog.languoids.languoid.Languoid
]
The classification can be accessed via a pyglottolog.languoids.Languoid
’s
attributes. In addition, it can be visualized via
- Glottolog.ascii_tree(start, maxlevel=None)[source]
Prints an ASCII representation of the languoid tree starting at start to stdout.
- Parameters
start (
typing.Union
[str
,pyglottolog.languoids.languoid.Languoid
]) –
and serialized as Newick string via
- Glottolog.newick_tree(start=None, template=None, nodes=None, maxlevel=None)[source]
Returns the Newick representation of a (set of) Glottolog classification tree(s).
- Parameters
start (
typing.Union
[None
,str
,pyglottolog.languoids.languoid.Languoid
]) – Root languoid of the tree (or None to return the complete classification).template (
typing.Optional
[str
]) – Python format string accepting the Languoid instance as single variable named l, used to format node labels.maxlevel (
typing.Union
[int
,pyglottolog.config.LanguoidLevel
,None
]) –
- Return type
str
Accessing reference data
Performance considerations
Reading the data for Glottolog’s more than 25,000 languoids from the same number of files in individual directories isn’t particularly quick. So on average computers running
>>> list(glottolog.languoids())
would take around 15 seconds.
Due to this, care should be taken not to read languoid data from disk repeatedly. In particular “N+1”-type problems should be avoided, where one would read all languoids into memory and then look up attributes on each languoid, thereby triggering new reads from disk. This may easily happen, since attributes such as Languoid.family are implemented as properties, which traverse the directory tree and read information from disk at access time.
To make it possible to avoid such problems, many of these properties can be substituted with a call to a similar method of Languoid, which accepts a “node map” (i.e. a dict mapping Languoid.id to Languoid objects) as parameter, e.g. Languoid.ancestors_from_nodemap or Languoid.descendants_from_nodemap. Typical usage would look as follows:
>>> languoids = {l.id: l for l in glottolog.languoids()}
>>> for l in languoids.values():
... if not l.ancestors_from_nodemap(languoids):
... print('top-level {0}: {1}'.format(l.level, l.name))
Alternatively, if you only want to read Glottolog data, you may enable caching
when instantiating pyglottolog.Glottolog
.
Configuration data
The config subdirectory of Glottolog data contains machine readable metadata like the list
of macroareas. This information can be accessed via an instance of
pyglottolog.Glottolog
, too, using the stem of the filename as attribute name:
>>> for ma in Glottolog().macroareas.values():
... print(ma.name)
...
South America
Eurasia
Africa
Papunesia
North America
Australia
Below are the details of the API to access configuration data.
- class pyglottolog.config.AES(ordinal, id, name, egids, unesco, elcat, reference_id, icon=None)[source]
AES status values
See also
Method generated by attrs for class AES.
- ordinal
Sequential numeric value
- id
unique identifier (suitable as Python name, see https://docs.python.org/3/reference/lexical_analysis.html#identifiers)
- name
unique human-readable name
- egids
corresponding status in the EGIDS scala
- unesco
corresponding status in the UNESCO scala
- elcat
corresponding status in ElCat
- reference_id
Glottolog reference ID linking to further information
- class pyglottolog.config.AESSource(id, name, url, reference_id, pages=None)[source]
Reference information for AES sources
Method generated by attrs for class AESSource.
- id
- name
- url
- reference_id
Glottolog reference ID linking to further information
- pages
- class pyglottolog.config.Macroarea(id, name, description, reference_id)[source]
Glottolog macroareas (see https://glottolog.org/meta/glossary#macroarea)
Method generated by attrs for class Macroarea.
- id
- name
- description
- reference_id
Glottolog reference ID linking to further information
- class pyglottolog.config.DocumentType(rank, id, name, description, abbv, bibabbv, webabbr, triggers)[source]
Document types categorize Glottolog references
Method generated by attrs for class DocumentType.
- rank
- id
- name
- description
- class pyglottolog.config.LanguageType(id, pseudo_family_id, category, description)[source]
Language types categorize languages.
Method generated by attrs for class LanguageType.
- id
- pseudo_family_id
Glottocode of the pseudo-family that languages of this type are grouped in.
- category
category name for languages of this type
- description
- class pyglottolog.config.LanguoidLevel(ordinal, id, description)[source]
Languoid levels describe the position of languoid nodes in the classification.
- Variables
name – alias for id
Method generated by attrs for class LanguoidLevel.
- ordinal
- id
- description
- class pyglottolog.config.Config[source]
More convenient access to objects stored as sections in INI files
This class makes objects (i.e. INI sections) accessible as values of a dict, keyed by an id attribute, which is infered from the id or name option of the section and, additonally, under as attribute named after id.
Languoid data
All metadata related to a languoid (i.e. the content of the languoid’s INI file
and the classification - its relation to other languoids) is available from a
pyglottolog.languoids.Languoid
instance.
- class pyglottolog.languoids.Languoid(cfg, lineage=None, id_=None, directory=None, tree=None, _api=None)[source]
Info on languoids is encoded in the INI files and in the directory hierarchy of
pyglottolog.Glottolog.tree
. This class provides access to all of it.Languoid formatting:
- Variables
_format_specs – A dict mapping custom format specifiers to conversion functions. Usage:
>>> l = Languoid.from_name_id_level(pathlib.Path('.'), 'N(a,m)e', 'abcd1234', 'language') >>> '{0:newick_name}'.format(l) 'N{a/m}e'
See also
https://www.python.org/dev/peps/pep-3101/#format-specifiers and https://www.python.org/dev/peps/pep-3101/#controlling-formatting-on-a-per-type-basis
- Parameters
cfg (
clldutils.inifile.INI
) –lineage (
typing.Optional
[typing.List
[typing.Tuple
[str
,str
,str
]]]) –id_ (
typing.Optional
[str
]) –directory (
typing.Optional
[pathlib.Path
]) –tree (
typing.Optional
[pathlib.Path
]) –
Refer to the factory methods for typical use cases of instantiating a Languoid:
Languoid.from_id_name_level()
- Parameters
cfg (
clldutils.inifile.INI
) – INI instance storing the languoid’s metadata.lineage (
typing.Optional
[typing.List
[typing.Tuple
[str
,str
,str
]]]) – list of ancestors (from root to this languoid).id – Glottocode for the languoid (or None, if directory is passed).
_api – Some properties require access to config data which is accessed through a Glottolog API instance.
id_ (
typing.Optional
[str
]) –directory (
typing.Optional
[pathlib.Path
]) –tree (
typing.Optional
[pathlib.Path
]) –
- classmethod from_dir(directory, nodes=None, _api=None, **kw)[source]
Create a Languoid from a directory, named with the Glottocode and containing md.ini.
This method is used by
pyglottolog.Glottolog
to read Languoid`s from the repository’s `languoids/tree directory.- Parameters
directory (
pathlib.Path
) –
- classmethod from_name_id_level(tree, name, id, level, **kw)[source]
This method is used in pyglottolog.lff to instantiate Languoid s for new nodes encountered in “lff”-format trees.
- newick_node(nodes=None, template=None, maxlevel=None, level=0)[source]
Return a newick.Node representing the subtree of the Glottolog classification starting at the languoid.
- Parameters
template – Python format string accepting the Languoid instance as single variable named l, used to format node labels.
- Return type
newick.Node
- write_info(outdir=None)[source]
Write Languoid metadata as INI file to outdir/<INFO_FILENAME>.
- Parameters
outdir (
typing.Optional
[pathlib.Path
]) –
- property glottocode
Alias for id
- property category
Languoid category.
Category name from
pyglottolog.config.LanguageType
for languoids of level “language”,“Family” or “Pseudo Family” for families,
“Dialect” for dialects.
- property isolate: bool
Flag signaling whether the languoid is an isolate, i.e. has level “language” and is not member of a family.
- Return type
bool
- property children: List[Languoid]
List of direct descendants of the languoid in the classification tree.
Note
Using this on many languoids can be slow, because the directory tree may be traversed and INI files read multiple times. To circumvent this problem, you may use a read-only
pyglottolog.Glottolog
instance, by passing cache=True at initialization.- Return type
typing.List
[pyglottolog.languoids.languoid.Languoid
]
- property ancestors: List[Languoid]
List of ancestors of the languoid in the classification tree, from root (i.e. top-level family) to parent node.
Note
Using this on many languoids can be slow, because the directory tree may be traversed and INI files read multiple times. To circumvent this problem, you may use a read-only
pyglottolog.Glottolog
instance, by passing cache=True at initialization.- Return type
typing.List
[pyglottolog.languoids.languoid.Languoid
]
- property parent: Optional[Languoid]
Parent languoid or None.
Note
Using this on many languoids can be slow, because the directory tree may be traversed and INI files read multiple times. To circumvent this problem, you may use a read-only
pyglottolog.Glottolog
instance, by passing cache=True at initialization.- Return type
typing.Optional
[pyglottolog.languoids.languoid.Languoid
]
- property family: Optional[Languoid]
Top-level family the languoid belongs to or None.
Note
Using this on many languoids can be slow, because the directory tree may be traversed and INI files read multiple times. To circumvent this problem, you may use a read-only
pyglottolog.Glottolog
instance, by passing cache=True at initialization.- Return type
typing.Optional
[pyglottolog.languoids.languoid.Languoid
]
- property names: Dict[str, list]
A dict mapping alternative name providers to list s of alternative names for the languoid by the given provider.
- Return type
typing.Dict
[str
,list
]
- property sources: List[Reference]
List of Glottolog references linked to the languoid
- Return type
pyglottolog.references.Reference
- property endangerment: Union[None, Endangerment]
Endangerment information about the languoid.
- Return type
- property classification_comment: Union[None, ClassificationComment]
Classification information about the languoid.
- Return type
- property ethnologue_comment: Union[None, EthnologueComment]
Commentary about the classification of the languoid in Ethnologue.
- Return type
- property links: List[Link]
Links to web resources related to the languoid
- Return type
typing.List
[pyglottolog.languoids.models.Link
]
- property countries: List[Country]
Countries a language is spoken in.
- Return type
typing.List
[pyglottolog.languoids.models.Country
]
- property name
The Glottolog mame of the languoid
- property latitude: Union[None, float]
The geographic latitude of the point chosen as representative coordinate of the languoid
- Return type
typing.Optional
[float
]
- property longitude: Union[None, float]
The geographic longitude of the point chosen as representative coordinate of the languoid
- Return type
typing.Optional
[float
]
- class pyglottolog.languoids.Glottocodes(fname)[source]
Registry keeping track of glottocodes that have been dealt out.
Some of the data available for languoids has enough internal structure to merit separate classes, simplyfying access.
- class pyglottolog.languoids.Reference(key, pages=None, trigger=None)[source]
A reference of a bibliographical record in Glottolog.
Method generated by attrs for class Reference.
- key
- pages
- class pyglottolog.languoids.Endangerment(status, source, comment, date)[source]
Info about the endangerment status of the languoid
- Parameters
status (
pyglottolog.config.AES
) –source (
pyglottolog.config.AESSource
) –
Method generated by attrs for class Endangerment.
- Parameters
status (
pyglottolog.config.AES
) –source (
pyglottolog.config.AESSource
) –
- comment
- date
Date when the endangerment status was assessed
- class pyglottolog.languoids.EthnologueComment(isohid, comment_type, ethnologue_versions='', comment=None)[source]
Commentary about the classification of the languoid according to Ethnologue
Method generated by attrs for class EthnologueComment.
- comment_type
Either
“spurious” meaning the comment is to explain why the languoid in question is spurious and in which Ethnologue (as below) that is/was
“missing” meaning the comment is to explain why the languoid in question is missing (as a language entry) and in which Ethnologue (as below) that is/was
- ethnologue_versions
Which Ethnologue version(s) from E16-E19 the comment pertains to, joined by /:s. E.g. E16/E17. In the case of comment_type=spurious, E16/E17 in the version field means that the code was spurious in E16/E17 but no longer spurious in E18/E19. In the case of comment_type=missing, E16/E17 would mean that the code was missing from E16/E17, but present in E18/E19. If the comment concerns a language where versions would be the empty string, instead the string ISO 639-3 appears.
- comment
- class pyglottolog.languoids.ISORetirement(code=None, name=None, change_request=None, effective=None, reason=None, change_to=NOTHING, remedy=None, comment=None)[source]
Information extracted from accepted ISO 639-3 change requests about retired ISO codes associated with the languoid.
Method generated by attrs for class ISORetirement.
- code
Retired ISO 639-3 code
- name
Name of the retired ISO language
- change_request
Number of the ISO change request
- effective
Date of acceptance of the change request
- reason
Reason to retire the ISO code
- change_to
List of ISO codes replacing the retired code
- remedy
What to do about the retired code
- comment
- class pyglottolog.languoids.ClassificationComment(sub=None, subrefs=NOTHING, family=None, familyrefs=NOTHING)[source]
Commentary on the classification of the languoid
Method generated by attrs for class ClassificationComment.
- sub
Commentary on the internal classification of the descendants of the languoid
- family
Commentary on the classification of the languoid within its family
Reference data
Glottolog’s reference data consists of bibliographical information in a set of BibTeX files, described with metadata in BIBFILES.ini.
This information can be accessed via an instance of
pyglottolog.Glottolog
, too:
>>> Glottolog()
>>> print(g.bibfiles['hh'].description)
The bibliography of HH, typed in between 2005-2020.
It has been annotated by hand (type and language).
It contains descriptive material from all over the world, mostly lesser-known languages.
>>> print(g.bibfiles['hh:s:Karang:Tati-Harzani'])
@book{s:Karang:Tati-Harzani,
author = {'Abd-al-'Ali Kārang},
title = {Tāti va Harzani, do lahja az zabān-i bāstān-e Āẕarbāyjān},
publisher = {Tabriz: Tabriz University Press},
address = {Tabriz},
pages = {6+160},
year = {1334 [1953]},
glottolog_ref_id = {41999},
hhtype = {grammar_sketch},
inlg = {Farsi [pes]},
lgcode = {Tati, Harzani [hrz]},
macro_area = {Eurasia}
}
The objects representing reference data are described below.
- class pyglottolog.references.BibFiles(bibfiles)[source]
Ordered collection of BibFile objects accessible by filname or index.
- classmethod from_path(path, api=None)[source]
BibTeX files from <path>/bibtex/*.bib if listed in <path>/BIBFILES.ini.
- Parameters
path (
typing.Union
[str
,pathlib.Path
]) –- Return type
- __getitem__(index_or_filename)[source]
Retrieve a bibfile by index or filename or an entry by qualified key.
- Parameters
index_or_filename (
typing.Union
[int
,str
]) – Either an int index, or a bibfile name, or a provider-qualified BibTeX key in the form <prov>:<key>.- Return type
typing.Union
[pyglottolog.references.bibfiles.BibFile
,pyglottolog.references.bibfiles.Entry
]- Returns
A BibFile instance, or an Entry instance.
- class pyglottolog.references.BibFile(fname, name=None, title=None, description=None, abbr=None, encoding='utf-8', normalize='NFC', sortkey=None, priority=0, url=None, curation=None, api=None)[source]
Represents a BibTeX file, storing a provider’s bibliography, providing easy access to its records.
- Parameters
fname (
pathlib.Path
) –
Method generated by attrs for class BibFile.
- Parameters
fname (
pathlib.Path
) –
- name
Short name of the bibliography
- title
Title of the bibliography
- description
The provenance of the bibliography
- url
URL pointing to the source of the bibliography
- curation
Curation policy for the bibliography at Glottolog
- class pyglottolog.references.Entry(key, type, fields, bib, api=None)[source]
Represents an entry in a BibFile, i.e. a bibliographical record.
Note
Entry instances are orderable. The ordering is the one used to compute MEDs, i.e.
grammars are “better” than other document types,
more pages is “better” than less,
more recent is “better” than old.
>>> g = pyglottolog.Glottolog() >>> g.bibfiles['hh:g:MacDonell:Sanskrit'] > g.bibfiles['hh:hv:Weijnen:Nederlandse'] True >>> refs = g.refs_by_languoid(gl.bibfiles['hh']) >>> sorted(refs[0]['stan1295'])[-1].med_type.name 'long grammar'
- Parameters
fields (
dict
) –
Method generated by attrs for class Entry.
- Parameters
fields (
dict
) –
- key
- type
BibTeX entry type
- fields: dict
The metadata of the record
- property id: str
The qualified entry ID, including the provider prefix.
- Return type
str
Computing homelands
Computing geo-coordinates for homelands of language groups, i.e. languoids of level family.
Various ways of computing “homelands” for language groups have been proposed in the literature since Sapir 1916. This module provides implementations of some of the simpler algorithms.
- pyglottolog.homelands.compute(api, method)[source]
Compute homelands for applicable Glottolog subgroups using a method implemented in this module or any callable with appropriate signature.
- Parameters
api (
pyglottolog.api.Glottolog
) –method (
typing.Callable
[[typing.List
[pyglottolog.languoids.languoid.Languoid
]],typing.Dict
[str
,typing.Tuple
[decimal.Decimal
,decimal.Decimal
]]]) –
- Return type
typing.Dict
[str
,typing.Tuple
[decimal.Decimal
,decimal.Decimal
]]
- pyglottolog.homelands.md(langs)[source]
Compute homeland coordinates for a language group (and its subgroups) as described as “md” method in “Testing methods of linguistic homeland detection using synthetic data” by Søren Wichmann and Taraka Rama https://doi.org/10.1098/rstb.2020.0202
Wichmann and Rama 2021:
In the third approach, abbreviated ‘md’ for ‘minimal distance’, we compute the average distance (as the crow flies) from each language to all the other languages. The location of the language that has the smallest average distance to the others is equated with the homeland.
We use the pyproj.Geod.inv method to compute the great-circle distance between two points.
- Parameters
langs (
typing.List
[pyglottolog.languoids.languoid.Languoid
]) –- Return type
typing.Dict
[str
,typing.Tuple
[decimal.Decimal
,decimal.Decimal
]]
- pyglottolog.homelands.recursive_centroids(langs)[source]
Recursively compute homelands of subgroups from the homelands of their immediate children in the classification.
The homeland of a single language is its geographic coordinate.
The homeland of a set of coordinates (for homelands or languages) is computed as nearest point on land of the centroid of the convex hull for the set of coordinates.
- Parameters
langs (
typing.List
[pyglottolog.languoids.languoid.Languoid
]) –- Return type
typing.Dict
[str
,typing.Tuple
[decimal.Decimal
,decimal.Decimal
]]