DAFSA Library¶
DAFSA is a library for computing Deterministic Acyclic Finite State
Automata
(also known as “directed acyclic word graphs”, or DAWG). DAFSA are data
structures derived from tries
that allow to represent a set of sequences (typically character strings
or n-grams) in the form of a directed acyclic graph with a single
source vertex (the start
symbol of all sequences) and at least one
sink edge (final
symbols, each pointed to by one or more sequences).
In the current implementation, a trait of each node expresses whether it
can be used a sink.
The primary difference between DAFSA and tries is that the latter
eliminates suffix and infix redundancy, as in the example of Figure 1
(from the linked Wikipedia article) storing the set of strings
"tap"
, "taps"
, "top"
, and "tops"
. Even though DAFSAs
cannot be used to store precise frequency information, given that
multiple paths can reach the same terminal node, they still allow to
estimate the sampling frequency; being acyclic, they can also reject any
sequence not included in the training. Fuzzy extensions will allow to
estimate the sampling probability of unobserved sequences.
This data structure is a special case of a finite state recognizer that acts as a deterministic finite state automaton, as it recognizes all and tonly the sequences it was built upon. It is frequently used in computer science for the space-efficient storage of sets of sequences without common compression techniques, such as dictionary or entropy types, or without probabilistic data structures, such as Bloom filters.
The automata generated by this library are intended for usage by linguists in studies on morphology and formal grammars, allowing for faster, easier, and simpler generation of visualizations. The library also extends published models by allowing to approximate probability of random observation by carrying information on the weight of each graph edge.
Installation¶
In any standard Python environment, dafsa
can be installed with:
pip install dafsa
A conda
package is also available, and can be installed from the
tresoldi
channel with:
conda install -c tresoldi dafsa
Alternatives¶
The main alternative to this library is the dawg
one, available at
https://github.com/pytries/DAWG. dawg
wraps the dwagdic
C++
library, and is intended to production usage of DAFSAs as a
space-efficient data structure. It does not support the computation of
edge weights, nor it is intended for exporting the internal structure as
a graph.
Other alternatives are the adfa
and minim
packages, written in
C/C++, written by Jan Daciuk. The personal webpage hosting them has been
offline for years, with a version at the Wayback
Machine
available. Note that the archived version does not include the
packages, which are available in our GitHub repository in the
/daciuk
folder (Jan Daciuk is in no way involved with this library,
and his code, originally released under the GPL2 license, is copied for
archival purposes).
Author¶
The library is developed by Tiago Tresoldi (tresoldi@shh.mpg.de).
The author has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. ERC Grant #715618, Computer-Assisted Language Comparison.
How to cite¶
If you use dafsa
, please cite it as:
Tresoldi, Tiago (2020). DAFSA, a a library for computing Deterministic Acyclic Finite State Automata. Version 1.0. Jena. DOI: 10.5281/zenodo.3668870
In BibTeX:
@misc{Tresoldi2020dafsa,
author = {Tresoldi, Tiago},
title = {DAFSA, a a library for computing Deterministic Acyclic Finite State Automata. Version 1.0},
howpublished = {\url{https://github.com/tresoldi/dafsa}},
address = {Jena},
doi = {10.5281/zenodo.3668870},
year = {2020},
}
References¶
Black, Paul E. and Pieterse, Vreda (eds.). 1998. “Directed acyclic word graph”, Dictionary of Algorithms and Data Structures. Gaithersburg: National Institute of Standards and Technology.
Blumer, Anselm C.; Blumer, Janet A.; Haussler, David; Ehrenfeucht, Andrzej; Chen, M.T.; Seiferas, Joel I. 1985. “The smallest automaton recognizing the subwords of a text”, Theoretical Computer Science, 40: 31–55. doi:10.1016/0304-3975(85)90157-4.
Ciura, Marcin G. and Deorowicz, Sebastian. 2002. “How to sequeeze a lexicon”, Software-Practice and Experience 31(11):1077-1090.
Daciuk, Jan; Mihov, Stoyan; Watson, Bruce and Watson, Richard. 2000. “Incremental construction of minimal acyclic finite state automata.” Computational Linguistics 26(1):3-16.
Havon, Steve. 2011. “Compressing dictionaries with a DAWG”. Steve Hanov’s Blog. url
Lucchesi, Cláudio L. and Kowaltowski, Tomasz. “Applications of finite automata representing large vocabularies”. Software-Practice and Experience. 1993: 15–30. CiteSeerX 10.1.1.56.5272.