Welcome to dafsa’s documentation!

PyPI Travis codecov Codacy Documentation Status Zenodo JOSS

dafsa is a Python library for computing Deterministic Acyclic Finite State Automata (also known as “directed acyclic word graphs”, or DAWG) for purposes of data exploration and visualization. DAFSAs are data structures derived from tries that allow to represent a set of sequences (typically character strings or n-grams) in the form of a directed acyclic graph with a single source vertex (the start symbol shared by all sequences) and at least one sink edge (final symbols, each pointed to by one or more sequences), such as in the following image.

Example of a DAFSA graph

Example of a DAFSA graph from a selection of segmented Italian words transcribed in the International Phonetic Alphabet

DAFSA Library

DAFSA is a library for computing Deterministic Acyclic Finite State Automata (also known as “directed acyclic word graphs”, or DAWG). DAFSA are data structures derived from tries that allow to represent a set of sequences (typically character strings or n-grams) in the form of a directed acyclic graph with a single source vertex (the start symbol of all sequences) and at least one sink edge (final symbols, each pointed to by one or more sequences). In the current implementation, a trait of each node expresses whether it can be used a sink.

The primary difference between DAFSA and tries is that the latter eliminates suffix and infix redundancy, as in the example of Figure 1 (from the linked Wikipedia article) storing the set of strings "tap", "taps", "top", and "tops". Even though DAFSAs cannot be used to store precise frequency information, given that multiple paths can reach the same terminal node, they still allow to estimate the sampling frequency; being acyclic, they can also reject any sequence not included in the training. Fuzzy extensions will allow to estimate the sampling probability of unobserved sequences.

Trie vs. DAFSA

Trie vs. DAFSA

This data structure is a special case of a finite state recognizer that acts as a deterministic finite state automaton, as it recognizes all and tonly the sequences it was built upon. It is frequently used in computer science for the space-efficient storage of sets of sequences without common compression techniques, such as dictionary or entropy types, or without probabilistic data structures, such as Bloom filters.

The automata generated by this library are intended for usage by linguists in studies on morphology and formal grammars, allowing for faster, easier, and simpler generation of visualizations. The library also extends published models by allowing to approximate probability of random observation by carrying information on the weight of each graph edge.

Installation

In any standard Python environment, dafsa can be installed with:

pip install dafsa

A conda package is also available, and can be installed from the tresoldi channel with:

conda install -c tresoldi dafsa

Alternatives

The main alternative to this library is the dawg one, available at https://github.com/pytries/DAWG. dawg wraps the dwagdic C++ library, and is intended to production usage of DAFSAs as a space-efficient data structure. It does not support the computation of edge weights, nor it is intended for exporting the internal structure as a graph.

Other alternatives are the adfa and minim packages, written in C/C++, written by Jan Daciuk. The personal webpage hosting them has been offline for years, with a version at the Wayback Machine available. Note that the archived version does not include the packages, which are available in our GitHub repository in the /daciuk folder (Jan Daciuk is in no way involved with this library, and his code, originally released under the GPL2 license, is copied for archival purposes).

Author

The library is developed by Tiago Tresoldi (tresoldi@shh.mpg.de).

The author has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. ERC Grant #715618, Computer-Assisted Language Comparison.

How to cite

If you use dafsa, please cite it as:

Tresoldi, Tiago (2020). DAFSA, a a library for computing Deterministic Acyclic Finite State Automata. Version 1.0. Jena. DOI: 10.5281/zenodo.3668870

In BibTeX:

@misc{Tresoldi2020dafsa,
  author = {Tresoldi, Tiago},
  title = {DAFSA, a a library for computing Deterministic Acyclic Finite State Automata. Version 1.0},
  howpublished = {\url{https://github.com/tresoldi/dafsa}},
  address = {Jena},
  doi = {10.5281/zenodo.3668870},
  year = {2020},
}

References

Black, Paul E. and Pieterse, Vreda (eds.). 1998. “Directed acyclic word graph”, Dictionary of Algorithms and Data Structures. Gaithersburg: National Institute of Standards and Technology.

Blumer, Anselm C.; Blumer, Janet A.; Haussler, David; Ehrenfeucht, Andrzej; Chen, M.T.; Seiferas, Joel I. 1985. “The smallest automaton recognizing the subwords of a text”, Theoretical Computer Science, 40: 31–55. doi:10.1016/0304-3975(85)90157-4.

Ciura, Marcin G. and Deorowicz, Sebastian. 2002. “How to sequeeze a lexicon”, Software-Practice and Experience 31(11):1077-1090.

Daciuk, Jan; Mihov, Stoyan; Watson, Bruce and Watson, Richard. 2000. “Incremental construction of minimal acyclic finite state automata.” Computational Linguistics 26(1):3-16.

Havon, Steve. 2011. “Compressing dictionaries with a DAWG”. Steve Hanov’s Blog. url

Lucchesi, Cláudio L. and Kowaltowski, Tomasz. “Applications of finite automata representing large vocabularies”. Software-Practice and Experience. 1993: 15–30. CiteSeerX 10.1.1.56.5272.

How to use

The library offers a DAFSA object to compute automata, along with methods for checking a sequence acceptance and for exporting the graph. A minimal usage is shown here:

>>> from dafsa import DAFSA
>>> d = DAFSA(["tap", "taps", "top", "tops", "dibs"])
>>> print(d)
DAFSA with 8 nodes and 9 edges (5 inserted sequences)
  +-- #0: 0(#1/5:<d>/1|#5/5:<t>/4) [('d', 1), ('t', 5)]
  +-- #1: n(#2/1:<i>/1) [('i', 2)]
  +-- #2: n(#3/1:<b>/1) [('b', 3)]
  +-- #3: n(#4/1:<s>/1) [('s', 4)]
  +-- #4: F() []
  +-- #5: n(#6/4:<a>/2|#6/4:<o>/2) [('o', 6), ('a', 6)]
  +-- #6: n(#7/4:<p>/4) [('p', 7)]
  +-- #7: F(#4/4:<s>/2) [('s', 4)]

Note how the resulting graph includes the 5 training sequences, with one start node (#0) that advances with either a t (observed four times) or a d symbol (observed a single time), a subsequent node to t that only advances with a and o symbols (#1), and so on. More information on how to interpret the structure of a DAFSA are given in the following subsections.

The structure is much clearer with a graphical representation:

>>> d.write_figure("example.png")
First example

A DAFSA object allows to check for the presence or absence of a sequence in its structure, returning a terminal node if it can find a path. If the structure was computed using frequency weights, as by default, the cumulative path weight is also returned.

>>> d.lookup("tap")
(F(#4/4:<s>/2), 10)
>>> d.lookup("tops")
(F(), 12)
>>> d.lookup("tapap") is None
True
>>> d.lookup("ta") is None
True

Besides a number of alternatives for exporting visualizations, the structures can also be exported as networkx graphs:

>>> d.to_graph()
<networkx.classes.graph.Graph object at 0x7f51d15b7ac8>

The library includes a command-line tool for reading files with lists of sequences, with one sequence per line:

$ cat resources/dna.txt
CGCGAAA
CGCGATA
CGGAAA
CGGATA
GGATA
AATA

$ dafsa resources/dna.txt -t png -o dna.png

Which will produce the following graph:

DNA example

Sequences are by default processed one character at time, with each character assumed to be a single token. Pre-tokenized data can be provided to the library in the format of a Python list or tuple, or specified in source by using spaces as token delimiters:

$ cat resources/phonemes.txt
o kː j o
o r e kː j o
n a z o
s e n t i r e
s e n s o
ɡ u a r d a r e
a m a r e
v o l a r e

$ dafsa resources/phonemes.txt -t png -o phonemes.png
Phoneme example

DAFSA structures can be exported in PDF, SVG, GLM, and DOT formats.

Walkthrough example

The simplest DAFSAs are capable of expressing only one string, linking its sequence of characters from the first to the last. The creation of such a DAFSA follows the procedures described above, here demonstrated with the string tap:

>>> from dafsa import DAFSA
>>> d1 = DAFSA(["tap"])
>>> print(d1)
DAFSA with 4 nodes and 3 edges (1 inserted sequences)
  +-- #0: 0(#1/1:<t>/1) [('t', 1)]
  +-- #1: n(#2/1:<a>/1) [('a', 2)]
  +-- #2: n(#3/1:<p>/1) [('p', 3)]
  +-- #3: F() []

The printed textual representation confirms that only one sequence was inserted, with three edges (or “transitions”, corresponding to the three characters in our string) among four nodes (that is, a start node, two transitional nodes, and a final node). A unique but meaningless index (in this basic example, the numbers from zero to three) identifies each node, an explicit textual representation (such as #1/1:<t>/1) that informs the type of node (0 for the start node, n for “normal” nodes which are neither starts nor ends, and F for final nodes), and a list of the edges originating in each node (designed as a human-friendly, or at least programmer-friendly, rendition of the textual representation).

In this first example, we have:

  • A node of index 0 (#0) and type 0 (that is, a start symbol), which, as indicated by the list [('t', 1)] can only transition (“advance”) with the character t, moving to the node of index 1.
  • A node of index 1 and type n (that is, a “normal” node), which can transition with the character a towards the node of index 2.
  • A similar node of index 2 and type n as above, which can only transition with the character p towards node 3.
  • A node of index 3 and type F (that is, a final one), which by definition ends a sequence.

We can easily generate a graphical representation:

>>> d1.write_figure("d1.png")
Walkthrough first DAFSA

We can experiment with expanding the collection of sequences covered by the DAFSA. If we include a second string dibs, without overlapping substrings with tap, we can observe how they only share the start and end symbols:

>>> d2 = DAFSA(["tap", "dibs"])
>>> print(d2)
DAFSA with 7 nodes and 7 edges (2 inserted sequences)
  +-- #0: 0(#1/2:<d>/1|#5/2:<t>/1) [('d', 1), ('t', 5)]
  +-- #1: n(#2/1:<i>/1) [('i', 2)]
  +-- #2: n(#3/1:<b>/1) [('b', 3)]
  +-- #3: n(#4/1:<s>/1) [('s', 4)]
  +-- #4: F() []
  +-- #5: n(#6/1:<a>/1) [('a', 6)]
  +-- #6: n(#4/1:<p>/1) [('p', 4)]

Here the start symbol (node of index zero) carries two transition possibilities: either the character d, which would move towards node #1, or the character t, which would move towards node #5. As the two strings share no information, each of these initial transitions would lead to independent paths with no inner alternatives (from d to i, b, and s, in the first case, and from t to a and p, in the second). The structure only merges at the final symbol of index #4, which can be reached both from node #3 (with a transition s) or from node #6 (with a transition p).

The same information is expressed visually:

>>> d2.write_figure("d2.png")
Walkthrough second DAFSA

The convenience of DAFSAs becomes clearer if we extend this collection of strings with elements that have overlapping material. For example, by introducing the string taps, which overlaps with tap at the beginning and with dibs at the end, we both trigger the automaton minimization, highlighting overlapping regions, and collect frequencies for the number of observed nodes and transitions.

>>> d3 = DAFSA(["tap", "dibs", "taps"])
>>> print(d3)
DAFSA with 8 nodes and 8 edges (3 inserted sequences)
  +-- #0: 0(#1/3:<d>/1|#5/3:<t>/2) [('t', 5), ('d', 1)]
  +-- #1: n(#2/1:<i>/1) [('i', 2)]
  +-- #2: n(#3/1:<b>/1) [('b', 3)]
  +-- #3: n(#4/1:<s>/1) [('s', 4)]
  +-- #4: F() []
  +-- #5: n(#6/2:<a>/2) [('a', 6)]
  +-- #6: n(#7/2:<p>/2) [('p', 7)]
  +-- #7: F(#4/2:<s>/1) [('s', 4)]

The start node #0 has only two transitions, as in the previous structure, because there are only two initial characters, t and d. However, frequency information (the number to the right of the slash in the unambiguous representation) informs that t is twice as frequent. It is worth noting that we now have two final nodes, #4 and #7, the latter of which is a “pass-through” final node, meaning that a path can end at this node, treating it as a final one, or advance to one of its transitions (here, an s to node #4), using it as a normal node. Here, the pass-through is due to the pair of strings tap/taps, as, once we reach node #7 with a transition p from node #6, we can either terminate the string or add an s.

Node #4 might be at first surprising, as both dibs and taps and with an s and one could expect them to share this suffix. A careful examination shows that this is not possible, as the final node #7 of taps is a pass-through: if dibs were to share such node, the DAFSA would entail that a dib string is a member of this set, which is false. Such a structure tends to be better visualized with a graphical representation:

>>> d3.write_figure("d3.png")
Walkthrough third DAFSA

One perhaps counter-intuitive but welcomed property of DAFSAs, illustrated by an expansion of the above example, is that enlarging the group of strings can reduce the automaton for their representation. If we add the “missing” string dib to our collection, the resulting DAFSA is:

>>> d4 = DAFSA(["tap", "dibs", "taps", "dib"])
>>> print(d4)
DAFSA with 7 nodes and 7 edges (4 inserted sequences)
  +-- #0: 0(#1/4:<d>/2|#5/4:<t>/2) [('t', 5), ('d', 1)]
  +-- #1: n(#2/2:<i>/2) [('i', 2)]
  +-- #2: n(#3/2:<b>/2) [('b', 3)]
  +-- #3: F(#4/4:<s>/2) [('s', 4)]
  +-- #4: F() []
  +-- #5: n(#6/2:<a>/2) [('a', 6)]
  +-- #6: n(#3/2:<p>/2) [('p', 3)]
Walkthrough fourth DAFSA

Usage example for scientific analysis

We have learned how to use DAFSA and explored how to understand its output. Users unfamiliar with the library might at this point find it an intersting and handy tool without having a precise idea of its purpose, also considering how its documentation repeats that, while this kind of automata is often used for space-efficient data storage, the library focuses on the study of common patterns in linguistic structures.

It is worth examining two examples of English “morphology”, related to the production of regular and irregular plural nouns. Please notice how morphology is here between quotes: even though the output of this library has been used for real morphological studies, chiefly as a prior hypothesis to the segmentation of words into their constituents, we are not referring to morphemes in the proper grammatical sense of the study of word structures in terms of processes such as inflection or derivation. In the context of this documentation, “morpheme” is equivalent to a substring or set of substrings.

Let’s begin by creating a DAFSA involving the singular and plural forms of two regular English nouns, “cat” and “dog”. As hoped, we obtain a design with a final pass-through node (here, of index #3), showing us that the paths can either end at this node (for singular forms) or transition with an extra character s to the other final node (index #4), producing the plural form.

DAFSA with 3 nodes and 3 edges (4 inserted sequences)
  +-- #0: 0(#3/4:<c a t>/2|#3/4:<d o g>/2) [('d o g', 3), ('c a t', 3)]
  +-- #3: F(#4/4:<s>/2) [('s', 4)]
  +-- #4: F() []
First real example

If we include another set of regular nouns (“cow” and “cows”), we notice how the global topography doesn’t change much, as again the paths can end at node #3 or advance with a character s to node #4. Notice how the library could further reduce the structure by identifying a shared initial c between “cat” and “cow”: from linguistic studies, including cognates, we know they are not the same morpheme in reality (they are not even a morpheme), but they are shared material that should be investigated if we knew nothing about the language under study.

DAFSA with 4 nodes and 5 edges (6 inserted sequences)
 +-- #0: 0(#1/6:<c>/4|#3/6:<d o g>/2) [('c', 1), ('d o g', 3)]
 +-- #1: n(#3/4:<a t>/2|#3/4:<o w>/2) [('a t', 3), ('o w', 3)]
 +-- #3: F(#4/6:<s>/3) [('s', 4)]
 +-- #4: F() []
Second real example

Adding further pairs of regular nouns will increase the complexity of the DAFSA, as they share less and less material, but would likewise highlight recurrent elements that are potential constituents. The library informs such frequencies in the textual representation, but the information is better visualized in the graphical output, with the size of the nodes and the width of the edges showing how many times we visit each one. Here, the c shared by “cat” and “cow” is less significant than the final s.

DAFSA with 5 nodes and 8 edges (10 inserted sequences)
  +-- #0: 0(#1/10:<c>/4|#3/10:<d o g>/2|#17/10:<g i r a f f>/2|#17/10:<t u r t l>/2) [('t u r t l', 17), ('d o g', 3), ('c', 1), ('g i r a f f', 17)]
  +-- #1: n(#3/4:<a t>/2|#3/4:<o w>/2) [('a t', 3), ('o w', 3)]
  +-- #3: F(#4/10:<s>/5) [('s', 4)]
  +-- #4: F() []
  +-- #17: n(#3/4:<e>/4) [('e', 3)]
Third real example

Including two pairs of irregular plurals, “goose/geese” and “ox/oxen”, will highlight the differences with the previously added words, as the final pass-through node for the plural with s, the one of index #3, is not reachable in these cases.

DAFSA with 8 nodes and 14 edges (14 inserted sequences)
  +-- #0: 0(#1/14:<c>/4|#3/14:<d o g>/2|#12/14:<g>/4|#29/14:<o x>/2|#21/14:<t u r t l>/2) [('o x', 29), ('d o g', 3), ('c', 1), ('g', 12), ('t u r t l', 21)]
  +-- #1: n(#3/4:<a t>/2|#3/4:<o w>/2) [('a t', 3), ('o w', 3)]
  +-- #3: F(#4/10:<s>/5) [('s', 4)]
  +-- #4: F() []
  +-- #12: n(#14/4:<e e>/1|#21/4:<i r a f f>/2|#14/4:<o o>/1) [('i r a f f', 21), ('e e', 14), ('o o', 14)]
  +-- #14: n(#4/2:<s e>/2) [('s e', 4)]
  +-- #21: n(#3/4:<e>/4) [('e', 3)]
  +-- #29: F(#4/2:<e n>/1) [('e n', 4)]
Fourth real example

As a final development for this example, we can add pairs which are not strictly regular, but which display a common pattern, such as “man/men” and “woman/women”. The automata can identify this structure, but notice how the graph also joins the final n of “man” and “woman” to the final “n” of “oxen”. We know from the history of the English language that these are likewise unrelated, but once more in the case of an unfamiliar language we should examine this pattern as a “candidate morpheme”.

DAFSA with 10 nodes and 19 edges (18 inserted sequences)
  +-- #0: 0(#1/18:<c>/4|#3/18:<d o g>/2|#12/18:<g>/4|#28/18:<m>/2|#34/18:<o x>/2|#21/18:<t u r t l>/2|#28/18:<w o m>/2) [('m', 28), ('w o m', 28), ('d o g', 3), ('o x', 34), ('c', 1), ('g', 12), ('t u r t l', 21)]
  +-- #1: n(#3/4:<a t>/2|#3/4:<o w>/2) [('a t', 3), ('o w', 3)]
  +-- #3: F(#4/10:<s>/5) [('s', 4)]
  +-- #4: F() []
  +-- #12: n(#14/4:<e e>/1|#21/4:<i r a f f>/2|#14/4:<o o>/1) [('i r a f f', 21), ('e e', 14), ('o o', 14)]
  +-- #14: n(#4/2:<s e>/2) [('s e', 4)]
  +-- #21: n(#3/4:<e>/4) [('e', 3)]
  +-- #28: n(#29/4:<a>/2|#29/4:<e>/2) [('a', 29), ('e', 29)]
  +-- #29: n(#4/5:<n>/5) [('n', 4)]
  +-- #34: F(#29/2:<e>/1) [('e', 29)]
Fifth real example

A DAFSA can highlight various regular processes concurrently. If we decide to investigate different patterns for producing plurals, using words ending with a sibilant consonant (as “dish” and “witch”), words ending in o (as “hero” and “potato”), and words ending in y (as “lady” and “sky”), besides some simple cases of the previous example (“cat” and “dog”), the processes of plural formation in English orthography are exposed.

DAFSA with 9 nodes and 16 edges (16 inserted sequences)
  +-- #0: 0(#3/16:<c a t>/2|#5/16:<d>/4|#16/16:<h e r>/2|#22/16:<l a d>/2|#16/16:<p o t a t>/2|#22/16:<s k>/2|#7/16:<w i t c>/2) [('l a d', 22), ('h e r', 16), ('s k', 22), ('d', 5), ('w i t c', 7), ('p o t a t', 16), ('c a t', 3)]
  +-- #3: F(#4/4:<s>/2) [('s', 4)]
  +-- #4: F() []
  +-- #5: n(#7/4:<i s>/2|#3/4:<o g>/2) [('o g', 3), ('i s', 7)]
  +-- #7: n(#8/4:<h>/4) [('h', 8)]
  +-- #8: F(#9/8:<e>/4) [('e', 9)]
  +-- #9: n(#4/6:<s>/6) [('s', 4)]
  +-- #16: n(#8/4:<o>/4) [('o', 8)]
  +-- #22: n(#9/4:<i e>/2|#4/4:<y>/2) [('y', 4), ('i e', 9)]
Sixth real example

Manual inspection and study can quickly become intractable, notably in real studies considering phonological processes and when the strings also carry annotation tags (such as labels informing if each form is a singular or a plural). This is expected, because even though the output is designed to be human-readable and explorable, the automaton still works akin to a lossless compressor, not unlike grammar-based codes and the smallest grammar problem (despite operating on lists of sequences and not on single strings). For this reason, the library allows to export DAFSAs as graphs, allowing its output to be used by other software and for other purposes.

dafsa

dafsa package

Submodules

dafsa.dafsa module

Main module for computing DAFSA/DAWG graphs from list of strings.

The library computes a Deterministic Acyclic Finite State Automata from a list of sequences in a non incremental way, with no plans to expand to incremental computation. The library was originally based on public domain code by Steve Hanov (2011).

class dafsa.dafsa.DAFSA(sequences, **kwargs)

Bases: object

Class representing a DAFSA object.

Parameters:
  • sequences (list) – List of sequences to be added to the DAFSA object.
  • weight (bool) – Whether to collect edge weights after minimization. Defaults to True.
  • condense (bool) – Whether to join sequences of transitions into single compound transitions whenever possible. Defaults to False.
  • delimiter (str) – The delimiter to use in case of joining single path transitions. Defaults to a single white space (” “).
  • minimize (bool) – Whether to minimize the trie into a DAFSA. Defaults to True; this option is implemented for development and testing purposes and it is not intended for users (there are specific and better libraries and algorithms to build tries).
condense()

Condenses the automaton, merging single-child nodes with their parents.

The function joins paths of unique edges into single edges with compound transitions, removing redundant nodes. A redundant node is defined as one that (a) is not final, (b) emits a single transition, (b) receives a single transition, and (d) its source emits a single transition.

Internally, the function will call the ._joining_round() method until no more candidates for joining are available. Performing everything in a single step would require a more complex logic.

count_edges()

Return the number of minimized edges in the structure.

Returns:edge_count – Number of minimized edges in the structure.
Return type:int
count_nodes()

Return the number of minimized nodes in the structure.

Returns:node_count – Number of minimized nodes in the structure.
Return type:int
count_sequences()

Return the number of sequences inserted in the structure.

Please note that the return value mirrors the number of sequences provided during initialization, and not a set of it: repeated sequences are accounted, as each will be added a single time to the object.

Returns:seq_count – Number of sequences in the structure.
Return type:int
lookup(sequence)

Check if a sequence can be expressed by the DAFSA.

The method does not return all possible potential paths, nor the cumulative weight: if this is needed, the DAFSA object should be converted to a Graph and other libraries, such as networkx, should be used.

Parameters:sequence (sequence) – Sequence to be checked for presence/absence.
Returns:node – Either a tuple with a DAFSANode referring to the final state that can be reached by following the specified sequence, plus the cumulative weight for reaching it, or None if no path can be found.
Return type:tuple of DAFSANode and int, or None
to_dot(**kwargs)

Return a representation of the DAFSA as a .dot (Graphviz) source code.

Parameters:
  • label_nodes (bool) – A boolean flag indicating whether or not to label nodes with their respective node ids (default: False).
  • weight_scale (float) – A floating point value indicating how much edges in the graph should be scaled in relation to their frequency (default: 1.5).
Returns:

source – The .DOT source code representing the DAFSA.

Return type:

str

to_graph()

Generate a networkx directed weighted graph representing the DAFSA.

Returns:graph – A networkx Graph representing the current object.
Return type:networkx.Graph
write_figure(output_file, dpi=300, **kwargs)

Generate a figure visualization through a graphviz call.

The filetype will be decided automatically from the filename extension. Please note that this method uses the subprocess library and will invoke the local dot library.

Parameters:
  • output_file (str) – The path to the output file.
  • dpi (int) – The output resolution, if applicable. Defaults to 300.
  • label_nodes (bool) – A boolean flag indicating whether or not to label nodes with their respective node ids (default: False).
  • weight_scale (float) – A floating point value indicating how much edges in the graph should be scaled in relation to their frequency (default: 1.5).
write_gml(filename)

Write the DAFSA in GML format to the file filename.

Parameters:filename (str) – The filename to write. Files whose names end with .gz or .bz2 will be compressed.
class dafsa.dafsa.DAFSAEdge(node, weight=0)

Bases: dict

Class representing edge objects in a DAFSA.

This class overloads a normal Python dictionary, and in simpler implementations could potentially be replaced with a pure dictionary. It was implemented as its own object for homogeneity and for planned future expansions, particularly in terms of fuzzy automata.

Parameters:
  • node (DAFSANode) – Reference to the target node, mandatory. Please note that it must be a DAFSANode object and not a node id.
  • weight (int) – Edge weight as collected from training data. Defaults to 0.
repr_hash()

Return a hash for the edge.

The returned has is based on the unambigous string representation provided by the .__repr__() method, allowing to use edges as, among others, dictionary keys. The method is complemented by the .__hash__() one.

Returns:hash – The hash from the unambigous textual representation of the current edge.
Return type:number
class dafsa.dafsa.DAFSANode(node_id)

Bases: object

Class representing node objects in a DAFSA.

Each object carries an internal node_id integer identifier which must be locally unique within a DAFSA, but is meaningless. There is no implicit order nor a sequential progression must be observed.

As in previous implementation by Hanov (2011), minimization is performed by comparing nodes, with equivalence determined by the standard Python .__eq__() method which overloads the equality operator. Nodes are considered identical if they have identical edges, which edges pointing from or to the same node. In particular, edge weight and node finalness, respectively expressed by the .weight and .final properties, are not considered. This allows to correctly count edges after minimization and to have final pass-through nodes.

Parameters:node_id (int) – The global unique ID for the current node.
repr_hash()

Return a hash for the node.

The returned has is based on the unambigous string representation provided by the .__repr__() method, allowing to use nodes as, among others, dictionary keys. The method is complemented by the .__hash__() one.

Returns:hash – The hash from the unambigous textual representation of the current node.
Return type:number

dafsa.output module

Module holding output functions.

Please note that this module is excluded from automatic security analysis, as it must perform system calls using subprocess.

dafsa.output.graphviz_output(dot_source, output_file, dpi=300)

Generates a visualization by calling the local graphviz.

The filetype will be decided from the extension of the filename.

Parameters:
  • output_file (str) – The path to the output file.
  • dpi (int) – The output resolution. Defaults to 300.
Returns:

ret – A CompleteProcess instance, as returned by the subprocess call.

Return type:

subprocess.CompletedProcess

dafsa.utils module

Module holding a number of utility functions.

The module also exports a RESOURCE_DIR object of type pathlib.Path, pointing to the local resources directory.

dafsa.utils.common_prefix_length(seq_a, seq_b)

Return the length of the common prefix between two sequences.

Parameters:
  • seq_a (iter) – An iterable holding the first sequence.
  • seq_b (iter) – An iterable holding the second sequence.
Returns:

length – The length of the common prefix between seq_a and seq_b.

Return type:

int

Examples

>>> import dafsa
>>> dafsa.utils.common_prefix_length("abcde", "abcDE")
3
>>> dafsa.utils.common_prefix_length("abcde", "ABCDE")
0
dafsa.utils.pairwise(iterable)

Iterate pairwise over an iterable.

The function follows the recipe offered on Python’s itertools documentation.

Parameters:iterable (iter) – The iterable to be iterate pairwise.

Examples

>>> import dafsa
>>> list(dafsa.utils.pairwise([1,2,3,4,5]))
[(1, 2), (2, 3), (3, 4), (4, 5)]

Module contents

__init__ file for the DAFSA library.

This single file allows to load the library and the main object easily, as in:

>>> from dafsa import DAFSA
>>> d = DAFSA(["tap", "taps", "top", "tops", "dibs"])
>>> print(d)
DAFSA with 8 nodes and 9 edges (5 inserted sequences)
  +-- #0: 0(#1/5:<d>/1|#5/5:<t>/4) [('t', 5), ('d', 1)]
  +-- #1: n(#2/1:<i>/1) [('i', 2)]
  +-- #2: n(#3/1:<b>/1) [('b', 3)]
  +-- #3: n(#4/1:<s>/1) [('s', 4)]
  +-- #4: F() []
  +-- #5: n(#6/4:<a>/2|#6/4:<o>/2) [('o', 6), ('a', 6)]
  +-- #6: n(#7/4:<p>/4) [('p', 7)]
  +-- #7: F(#4/4:<s>/2) [('s', 4)]

Indices and tables