dafsa package

Submodules

dafsa.dafsa module

Main module for computing DAFSA/DAWG graphs from list of strings.

The library computes a Deterministic Acyclic Finite State Automata from a list of sequences in a non incremental way, with no plans to expand to incremental computation. The library was originally based on public domain code by Steve Hanov (2011).

class dafsa.dafsa.DAFSA(sequences, **kwargs)

Bases: object

Class representing a DAFSA object.

Parameters:
  • sequences (list) – List of sequences to be added to the DAFSA object.
  • weight (bool) – Whether to collect edge weights after minimization. Defaults to True.
  • condense (bool) – Whether to join sequences of transitions into single compound transitions whenever possible. Defaults to False.
  • delimiter (str) – The delimiter to use in case of joining single path transitions. Defaults to a single white space (” “).
  • minimize (bool) – Whether to minimize the trie into a DAFSA. Defaults to True; this option is implemented for development and testing purposes and it is not intended for users (there are specific and better libraries and algorithms to build tries).
condense()

Condenses the automaton, merging single-child nodes with their parents.

The function joins paths of unique edges into single edges with compound transitions, removing redundant nodes. A redundant node is defined as one that (a) is not final, (b) emits a single transition, (b) receives a single transition, and (d) its source emits a single transition.

Internally, the function will call the ._joining_round() method until no more candidates for joining are available. Performing everything in a single step would require a more complex logic.

count_edges()

Return the number of minimized edges in the structure.

Returns:edge_count – Number of minimized edges in the structure.
Return type:int
count_nodes()

Return the number of minimized nodes in the structure.

Returns:node_count – Number of minimized nodes in the structure.
Return type:int
count_sequences()

Return the number of sequences inserted in the structure.

Please note that the return value mirrors the number of sequences provided during initialization, and not a set of it: repeated sequences are accounted, as each will be added a single time to the object.

Returns:seq_count – Number of sequences in the structure.
Return type:int
lookup(sequence)

Check if a sequence can be expressed by the DAFSA.

The method does not return all possible potential paths, nor the cumulative weight: if this is needed, the DAFSA object should be converted to a Graph and other libraries, such as networkx, should be used.

Parameters:sequence (sequence) – Sequence to be checked for presence/absence.
Returns:node – Either a tuple with a DAFSANode referring to the final state that can be reached by following the specified sequence, plus the cumulative weight for reaching it, or None if no path can be found.
Return type:tuple of DAFSANode and int, or None
to_dot(**kwargs)

Return a representation of the DAFSA as a .dot (Graphviz) source code.

Parameters:
  • label_nodes (bool) – A boolean flag indicating whether or not to label nodes with their respective node ids (default: False).
  • weight_scale (float) – A floating point value indicating how much edges in the graph should be scaled in relation to their frequency (default: 1.5).
Returns:

source – The .DOT source code representing the DAFSA.

Return type:

str

to_graph()

Generate a networkx directed weighted graph representing the DAFSA.

Returns:graph – A networkx Graph representing the current object.
Return type:networkx.Graph
write_figure(output_file, dpi=300, **kwargs)

Generate a figure visualization through a graphviz call.

The filetype will be decided automatically from the filename extension. Please note that this method uses the subprocess library and will invoke the local dot library.

Parameters:
  • output_file (str) – The path to the output file.
  • dpi (int) – The output resolution, if applicable. Defaults to 300.
  • label_nodes (bool) – A boolean flag indicating whether or not to label nodes with their respective node ids (default: False).
  • weight_scale (float) – A floating point value indicating how much edges in the graph should be scaled in relation to their frequency (default: 1.5).
write_gml(filename)

Write the DAFSA in GML format to the file filename.

Parameters:filename (str) – The filename to write. Files whose names end with .gz or .bz2 will be compressed.
class dafsa.dafsa.DAFSAEdge(node, weight=0)

Bases: dict

Class representing edge objects in a DAFSA.

This class overloads a normal Python dictionary, and in simpler implementations could potentially be replaced with a pure dictionary. It was implemented as its own object for homogeneity and for planned future expansions, particularly in terms of fuzzy automata.

Parameters:
  • node (DAFSANode) – Reference to the target node, mandatory. Please note that it must be a DAFSANode object and not a node id.
  • weight (int) – Edge weight as collected from training data. Defaults to 0.
repr_hash()

Return a hash for the edge.

The returned has is based on the unambigous string representation provided by the .__repr__() method, allowing to use edges as, among others, dictionary keys. The method is complemented by the .__hash__() one.

Returns:hash – The hash from the unambigous textual representation of the current edge.
Return type:number
class dafsa.dafsa.DAFSANode(node_id)

Bases: object

Class representing node objects in a DAFSA.

Each object carries an internal node_id integer identifier which must be locally unique within a DAFSA, but is meaningless. There is no implicit order nor a sequential progression must be observed.

As in previous implementation by Hanov (2011), minimization is performed by comparing nodes, with equivalence determined by the standard Python .__eq__() method which overloads the equality operator. Nodes are considered identical if they have identical edges, which edges pointing from or to the same node. In particular, edge weight and node finalness, respectively expressed by the .weight and .final properties, are not considered. This allows to correctly count edges after minimization and to have final pass-through nodes.

Parameters:node_id (int) – The global unique ID for the current node.
repr_hash()

Return a hash for the node.

The returned has is based on the unambigous string representation provided by the .__repr__() method, allowing to use nodes as, among others, dictionary keys. The method is complemented by the .__hash__() one.

Returns:hash – The hash from the unambigous textual representation of the current node.
Return type:number

dafsa.output module

Module holding output functions.

Please note that this module is excluded from automatic security analysis, as it must perform system calls using subprocess.

dafsa.output.graphviz_output(dot_source, output_file, dpi=300)

Generates a visualization by calling the local graphviz.

The filetype will be decided from the extension of the filename.

Parameters:
  • output_file (str) – The path to the output file.
  • dpi (int) – The output resolution. Defaults to 300.
Returns:

ret – A CompleteProcess instance, as returned by the subprocess call.

Return type:

subprocess.CompletedProcess

dafsa.utils module

Module holding a number of utility functions.

The module also exports a RESOURCE_DIR object of type pathlib.Path, pointing to the local resources directory.

dafsa.utils.common_prefix_length(seq_a, seq_b)

Return the length of the common prefix between two sequences.

Parameters:
  • seq_a (iter) – An iterable holding the first sequence.
  • seq_b (iter) – An iterable holding the second sequence.
Returns:

length – The length of the common prefix between seq_a and seq_b.

Return type:

int

Examples

>>> import dafsa
>>> dafsa.utils.common_prefix_length("abcde", "abcDE")
3
>>> dafsa.utils.common_prefix_length("abcde", "ABCDE")
0
dafsa.utils.pairwise(iterable)

Iterate pairwise over an iterable.

The function follows the recipe offered on Python’s itertools documentation.

Parameters:iterable (iter) – The iterable to be iterate pairwise.

Examples

>>> import dafsa
>>> list(dafsa.utils.pairwise([1,2,3,4,5]))
[(1, 2), (2, 3), (3, 4), (4, 5)]

Module contents

__init__ file for the DAFSA library.

This single file allows to load the library and the main object easily, as in:

>>> from dafsa import DAFSA
>>> d = DAFSA(["tap", "taps", "top", "tops", "dibs"])
>>> print(d)
DAFSA with 8 nodes and 9 edges (5 inserted sequences)
  +-- #0: 0(#1/5:<d>/1|#5/5:<t>/4) [('t', 5), ('d', 1)]
  +-- #1: n(#2/1:<i>/1) [('i', 2)]
  +-- #2: n(#3/1:<b>/1) [('b', 3)]
  +-- #3: n(#4/1:<s>/1) [('s', 4)]
  +-- #4: F() []
  +-- #5: n(#6/4:<a>/2|#6/4:<o>/2) [('o', 6), ('a', 6)]
  +-- #6: n(#7/4:<p>/4) [('p', 7)]
  +-- #7: F(#4/4:<s>/2) [('s', 4)]