pyglottolog.Glottolog
Most of the Glottolog’s data can be accessed through an instance of pyglottolog.Glottolog
.
- class pyglottolog.Glottolog(repos='.', *, cache=False)[source]
API to access Glottolog data
This class provides (read and write) access to a local copy of the Glottolog data, which can be obtained as explained in the README
- Parameters
cache (
bool
) –repos – Path to a copy of https://github.com/glottolog/glottolog
cache – Indicate whether to cache Languoid objects or not. If True, the API must be used read-only.
- repos: Path
Absolute path to the copy of the data repository:
- tree: Path
Absolute path to the tree directory in the repos.
- references_path(*comps)[source]
Path within the references directory of the repos.
- Parameters
comps (
str
) –
- iso[source]
- Returns
clldutils.iso_639_3.ISO instance, fed with the data of the latest ISO code table zip found in the build directory.
- property ftsindex: Path
Directory within build where the FullTextSearch index is created.
- Return type
pathlib.Path
Accessing configuration data
Configuration data in https://github.com/glottolog/glottolog/tree/master/config
can be accessed conveniently via the following properties of
pyglottolog.Glottolog
:
See configuration data for details about the returned objects.
Accessing languoid data
- property Glottolog.glottocodes: Glottocodes
Registry of Glottocodes.
- Return type
- Glottolog.languoid(id_)[source]
Retrieve a languoid specified by language code.
- Parameters
id – Glottocode or ISO code.
id_ (
typing.Union
[str
,pyglottolog.languoids.languoid.Languoid
]) –
- Return type
- Glottolog.languoids(ids=None, maxlevel=None, exclude_pseudo_families=False)[source]
Yields languoid objects.
- Parameters
ids (
typing.Optional
[set
]) – set of Glottocodes to limit the result to. This is useful to increase performance, since INI file reading can be skipped for languoids not listed.maxlevel (
typing.Union
[int
,pyglottolog.config.LanguoidLevel
,str
,None
]) – Numeric maximal nesting depth of languoids, or Languoid.level.exclude_pseudo_families (
bool
) – Flag signaling whether to exclude pseud families, i.e. languoids from non-genealogical trees.
- Return type
typing.Generator
[pyglottolog.languoids.languoid.Languoid
,None
,None
]
- Glottolog.languoids_by_code(nodes=None)[source]
Returns a dict mapping the three major language code schemes (Glottocode, ISO code, and Harald’s NOCODE_s) to Languoid objects.
- Return type
typing.Dict
[str
,pyglottolog.languoids.languoid.Languoid
]
The classification can be accessed via a pyglottolog.languoids.Languoid
’s
attributes. In addition, it can be visualized via
- Glottolog.ascii_tree(start, maxlevel=None)[source]
Prints an ASCII representation of the languoid tree starting at start to stdout.
- Parameters
start (
typing.Union
[str
,pyglottolog.languoids.languoid.Languoid
]) –
and serialized as Newick string via
- Glottolog.newick_tree(start=None, template=None, nodes=None, maxlevel=None)[source]
Returns the Newick representation of a (set of) Glottolog classification tree(s).
- Parameters
start (
typing.Union
[None
,str
,pyglottolog.languoids.languoid.Languoid
]) – Root languoid of the tree (or None to return the complete classification).template (
typing.Optional
[str
]) – Python format string accepting the Languoid instance as single variable named l, used to format node labels.maxlevel (
typing.Union
[int
,pyglottolog.config.LanguoidLevel
,None
]) –
- Return type
str
Accessing reference data
Performance considerations
Reading the data for Glottolog’s more than 25,000 languoids from the same number of files in individual directories isn’t particularly quick. So on average computers running
>>> list(glottolog.languoids())
would take around 15 seconds.
Due to this, care should be taken not to read languoid data from disk repeatedly. In particular “N+1”-type problems should be avoided, where one would read all languoids into memory and then look up attributes on each languoid, thereby triggering new reads from disk. This may easily happen, since attributes such as Languoid.family are implemented as properties, which traverse the directory tree and read information from disk at access time.
To make it possible to avoid such problems, many of these properties can be substituted with a call to a similar method of Languoid, which accepts a “node map” (i.e. a dict mapping Languoid.id to Languoid objects) as parameter, e.g. Languoid.ancestors_from_nodemap or Languoid.descendants_from_nodemap. Typical usage would look as follows:
>>> languoids = {l.id: l for l in glottolog.languoids()}
>>> for l in languoids.values():
... if not l.ancestors_from_nodemap(languoids):
... print('top-level {0}: {1}'.format(l.level, l.name))
Alternatively, if you only want to read Glottolog data, you may enable caching
when instantiating pyglottolog.Glottolog
.