reslib.automate package¶

Subpackages¶

Submodules¶

reslib.automate.code_parser module¶

reslib.automate.code_parser¶

This module contains the basic functionality to parse dependencies from code. Currently it uses specially formatted comments to do so, hopefully one day it will automatically extract dependencies.

Assumes files look something like this:

┌────────────────────────┐

INPUT FILES —–> │ This file runs and │ –> This file (file path)

│ creates some output │

│ or writes data to │

INPUT DATASETS –> │ disk. │ –> OUTPUT DATASETS

└────────────────────────┘

Parses comments that look like “# INPUT_FILE:” or “# INPUT_DATASET:” or “# OUTPUT_DATASET:” store them. Files can be ignored by adding the comment: “RESLIB_IGNORE: True”

copyright

2019 by Maclean Gaulin.

license

MIT, see LICENSE for more details.

class reslib.automate.code_parser.CodeParser(path_relative=None, path_absolute=None, project_root='.', code_path_prefix=None, data_path_prefix=None, **kwargs)[source]¶

Bases: object

CodeParser imagines a file as something that takes input, and makes output:

┌────────────────────────┐

INPUT FILES —–> │ This file runs and │ –> This file (file path)

│ creates some output │

│ or writes data to │

INPUT DATASETS –> │ disk. │ –> OUTPUT DATASETS

└────────────────────────┘

path_relative¶

Path of the code to be analyzed, relative to os.path.join(project_root, code_path_prefix). Defaults to None, which means no file has been scanned.

Type: str, path, None

code_path_prefix¶

Relative path to the code directory, starting from project_root. Defaults to None, meaning all code paths are relative to project_root.

Type: str, path, None

data_path_prefix¶

Relative path to the data directory, starting from project_root. Defaults to None, meaning all data paths are relative to project_root.

Type: str, path, None

project_root¶

Root of the project. If your project has multiple roots, I can’t help you friend.

Type: str, path

input_files¶

Set of input files scanned from comments in the code.

Type: set

input_datasets¶

Set of input datasets scanned from comments in the code.

Type: set

output_datasets¶

Set of output datasets scanned from comments in the code.

Type: set

Private Attributes:

_language (str): Short name for the language of the CodeParser. _extension (str): File extension of the code for this language. _file_match_regex (re): Regular expression to match files to be checked by this parser. Default: *._extension. _file_encoding (str): Encoding of the file to be opened (passed to open(path, encoding=self._file_encoding)) _comment_start (str): The string to search for demarking the start of the comment. _comment_start_regex (bool): Flag denoting the _comment_start variable is a regular expression

(pre-compiled or string to be complied).

_comment_end (str): The string to search for demarking the end of the comment. _comment_end_regex (bool): Flag denoting the _comment_end variable is a regular expression

(pre-compiled or string to be complied).

_ignore_comment_text (str): String denoting the ignore comment. _input_file_comment_text (str): String denoting the input file type. _input_dataset_comment_text (str): String denoting the input dataset type. _output_dataset_comment_text (str): String denoting the output dataset type. _ignore_comment_regex (re): Regular expression complied from the comment start/end attribute and ignore comment

text. Used to signal a particular file should be ignored.

_input_file_comment_regex (re): Regular expression complied from the comment start/end attribute and comment: text. Used to find input file comments in the code.
_input_dataset_comment_regex (re): Regular expression complied from the comment start/end attribute and comment: text. Used to find input dataset comments in the code.
_output_dataset_comment_regex (re): Regular expression complied from the comment start/end attribute and comment: text. Used to find output dataset comments in the code.

analyze(path_relative=None, path_absolute=None)[source]¶

Analyze the actual file.

Calls self.analyze_code(file_contents) after reading the file.

Parameters: path_relative (str) – Path to the file. Defaults to None, taking the path to analyze from path_relative.
Returns: Dictionary of the resulting code object dependencies
Return type: dict
Raises: UnicodeDecodeError – Raised if file is not encoded according to self._file_encoding (default: utf-8)

analyze_code(code)[source]¶

Analyze the text of the file.

Parameters: code (str) – Text of the file to be analyzed.
Returns: Returns True if the code had tags to parse.
Return type: bool

check_parent_relationships(potential_parent)[source]¶

Tests all outputs of provided potential parent to see if they match this object’s inputs.

Parameters

potential_parent (CodeParser) – Potential parent code object to test the outputs against self’s inputs.

Returns

List of overlapping files. These will be either the full_path of potential_parent or its: output datasets.

Return type

list

code_path_prefix = None¶: Relative path to the code directory, starting from project_root.

data_path_prefix = None¶: Relative path to the data directory, starting from project_root.

input_datasets = None¶: Set of input datasets scanned from comments in the code.

input_files = None¶: Set of input files scanned from comments in the code.

property is_parsed¶

classmethod matches(path_relative)[source]¶

matches_input(file_to_check)[source]¶

Asks “Is this one of your inputs”, for testing if this is your ‘child’.

Checks the input file against this code’s name (for a file dependency) and outputs (for a dataset dependency).

Parameters: file_to_check (str, path) – File to check against this one’s ‘outputs’
Returns: Returns True if file_to_check is in the input files or datasets.
Return type: (bool)

Example

parent = CodeParser(path_relative=”a.sas”) child = CodeParser(path_relative=”b.sas”)

for f in parent.output_datasets:

match = child.matches_input(f) if not match:

print(f”No relation for {f}”)

else:: print(f”{child.path_relative} is the {match} we are looking for”)

matches_output(file_to_check)[source]¶

“Are you my mother” test. Returns True-like if the file matches one of the ‘outputs’ of this code. Should be tested against inputs of other code to find a parent dependency.

Checks the input file against this code’s path_relative (for a file dependency) and dataset outputs (for a dataset dependency).

Parameters: file_to_check (str, path) – File to check against this one’s ‘outputs’, the relative path.
Returns: The path to the file or dataset that matches, otherwise None.
Return type: (str, None)

Example

parent = CodeParser(path_relative=”a.sas”) child = CodeParser(path_relative=”b.sas”)

for f in child.input_files | child.input_datasets:

match = parent.matches_output(f) if not match:

print(f”No relation for {f}”)

else:: print(f”{parent.path_relative} is the {match} we are looking for”)

output_datasets = None¶: Set of output datasets scanned from comments in the code.

path_absolute = None¶: Absolute path of the code to be analyzed. Defaults to None, which means no file has been scanned.

path_relative = None¶: Relative path of the code to be analyzed, relative to os.path.join(project_root, code_path_prefix). Defaults to None, which means no file has been scanned.

project_root = None¶: Root of the project. If your project has multiple roots, I can’t help you friend.

set_path(path_relative=None, path_absolute=None, project_root='.', code_path_prefix=None, data_path_prefix=None)[source]¶

Set the file path of the analyzed object, and calculate its relative position to base_dir.

Parameters

path_relative (str, path, None) – Path of the code to be analyzed, relative to os.path.join(project_root, code_path_prefix). Defaults to None.
path_absolute (str, path, None) – Absolute path of the code to be analyzed, from which path_relative is calculated. Ignored if path_relative is provided. Defaults to None.
code_path_prefix (str, path, None) – Relative path to the code directory, starting from project_root. Defaults to None, meaning all code paths are relative to project_root.
data_path_prefix (str, path, None) – Relative path to the data directory, starting from project_root. Defaults to None, meaning all data paths are relative to project_root.
project_root (str, path) – Root of the project. If your project has multiple roots, I can’t help you friend.

class reslib.automate.code_parser.CodeParserMetaclass[source]¶

Bases: type

Compile a CodeParser class to include the regex and parsing functionality.

This allows for SAS.matches instead of SAS().matches

class reslib.automate.code_parser.Manual(path_relative=None, path_absolute=None, project_root='.', code_path_prefix=None, data_path_prefix=None, **kwargs)[source]¶

Bases: reslib.automate.code_parser.CodeParser

Manual downloader is used to give instructions for a manual step that isn’t automated in code.

class reslib.automate.code_parser.Notebook(path_relative=None, path_absolute=None, project_root='.', code_path_prefix=None, data_path_prefix=None, **kwargs)[source]¶: Bases: reslib.automate.code_parser.CodeParser

class reslib.automate.code_parser.Python(path_relative=None, path_absolute=None, project_root='.', code_path_prefix=None, data_path_prefix=None, **kwargs)[source]¶: Bases: reslib.automate.code_parser.CodeParser

class reslib.automate.code_parser.SAS(path_relative=None, path_absolute=None, project_root='.', code_path_prefix=None, data_path_prefix=None, **kwargs)[source]¶: Bases: reslib.automate.code_parser.CodeParser

class reslib.automate.code_parser.Stata(path_relative=None, path_absolute=None, project_root='.', code_path_prefix=None, data_path_prefix=None, **kwargs)[source]¶: Bases: reslib.automate.code_parser.CodeParser

reslib.automate.scanner module¶

reslib.automate.scanner¶

This module contains the basic functionality to scan a code-base and extract dependencies from comments.

Assumes files look something like this:

┌────────────────────────┐

INPUT FILES —–> │ This file runs and │ –> This file (file path)

│ creates some output │

│ or writes data to │

INPUT DATASETS –> │ disk. │ –> OUTPUT DATASETS

└────────────────────────┘

Files can be ignored by adding the comment: RESLIB_IGNORE: True

Uses reslib.automate.code_parser.CodeParser objects to extract comments from code, then calculates the dependency graph. This stemmed from doit graph, but I wanted more flexibility.

copyright

2019 by Maclean Gaulin.

license

MIT, see LICENSE for more details.

class reslib.automate.scanner.DependencyScanner(*parsers, project_root='.', code_path_prefix=None, data_path_prefix=None, ignore_folders=None)[source]¶

Bases: object

Scan a code-base for dependencies.

project_root¶

Absolute path to project root. Will call os.path.abspath() if input is not already so.

Type: str, path

code_path_prefix¶

Prefix string to add to any code, used to resolve the absolute path via: `os.path.join(project_root, code_path_prefix, relative_code_path_from_comment)`.

Type: str, path, None

data_path_prefix¶

Prefix string to add to any data, used to resolve the absolute path via: `os.path.join(project_root, data_path_prefix, relative_data_path_from_comment)`.

Type: str, path, None

parser_list¶: List of CodeParser subclass objects (not instances!).

scanned_code¶: Result list of scanned CodeParser instances.

default_dot_attributes¶

Tuple of lines to be added to the .dot file output.

Private Attributes:: _scanned_code: List of scanned CodeParser instances. _ignore_folders: Set of folders (thus also sub-folders) to ignore.

Examples

Assume the following three files exist in the ~/projects/example folder:

```code/data.sas: /* INPUT_DATASET funda.sas7bdat / PROC EXPORT DATA=funda OUTFILE= “data/stata_data.dta”; RUN; / OUTPUT: stata_data.dta */

```

```code/load_data.do: /* INPUT_DATASET stata_data.dta */ use “data/stata_data.dta”

```

```code/analysis.do: /* INPUT_FILE: load_data.do */ do “code/load_data.do”

```

Then the following would create a graph output at pipeline.pdf:

from reslib.automate import DependencyScanner

# Just scan for SAS and Stata code, located in the code directory.
ds = DependencyScanner(project_root='~/projects/example/',
                      code_path_prefix='code', data_path_prefix='data')
print(ds)
ds.DAG_to_file("pipeline.pdf")

Alternatively, a one-liner on the commandline:

python -c “from reslib.automate import *;DependencyScanner(code_path_prefix=’data’, data_path_prefix=’code’).DAG_to_file(‘pipeline’)”

DAG(color_orphans=True, trim_dangling_data_nodes=True)[source]¶

Create the Directed Acyclic Graph (DAG) for the codebase.

Returns: DiGraph of the codebase, represented in networkX format.
Return type: networkx.DiGraph

DAG_to_file(filepath, G=None)[source]¶

Write a graphviz-style *.dot file to be converted into

Parameters: filepath (str,Path,File) – Path (or open file object) to write the .dot file to.

default_dot_attributes = {'edge': ['arrowsize=1.5'], 'graph': ['rankdir=LR'], 'node': ['style=filled']}¶: Default attributes to add to the .dot file output.

parser_list = None¶: List of Parsers to check against cost. Defaults to [SAS, Stata, Notebook, Python]

scan()[source]¶

Scan through the directory starting from self.project_root (or override_path if provided), calling analyze(file) for each file that matches *.extension.

The dir that is passed to parser.analyze is always based on what was passed in. If project_root is absolute, the parser will get absolute paths. If it is relative, it will get relatives paths.

Each CodeParser object contains four important values:

relative_path input_files input_datasets output_datasets

property scanned_code¶

Module contents¶

reslib.automate¶

This package facilitates using doit (pydoit.org) to automate data pipelines.

copyright

2019 by Maclean Gaulin.

license

MIT, see LICENSE for more details.

reslib.automate.cleanpath(path_to_clean, re_pathsep=re.compile('[\\\\]+'), re_dotstart=re.compile('^./|/$'))[source]¶: Clean a path by replacing \\ with /, and removing beginning ./ and trailing /

reslib.automate.pathjoin(*paths)[source]¶: Join, normalize, and clean a list of paths, allowing for ``None``s (filtered out)