reslib.automate package

Subpackages

Submodules

reslib.automate.code_parser module

reslib.automate.code_parser

This module contains the basic functionality to parse dependencies from code. Currently it uses specially formatted comments to do so, hopefully one day it will automatically extract dependencies.

Assumes files look something like this:

┌────────────────────────┐
INPUT FILES —–> │ This file runs and │ –> This file (file path)
│ creates some output │
│ or writes data to │
INPUT DATASETS –> │ disk. │ –> OUTPUT DATASETS
└────────────────────────┘

Parses comments that look like “# INPUT_FILE:” or “# INPUT_DATASET:” or “# OUTPUT_DATASET:” store them. Files can be ignored by adding the comment: “RESLIB_IGNORE: True”

copyright
  1. 2019 by Maclean Gaulin.

license

MIT, see LICENSE for more details.

class reslib.automate.code_parser.CodeParser(path_relative=None, path_absolute=None, project_root='.', code_path_prefix=None, data_path_prefix=None, **kwargs)[source]

Bases: object

CodeParser imagines a file as something that takes input, and makes output:

┌────────────────────────┐
INPUT FILES —–> │ This file runs and │ –> This file (file path)
│ creates some output │
│ or writes data to │
INPUT DATASETS –> │ disk. │ –> OUTPUT DATASETS
└────────────────────────┘
path_relative

Path of the code to be analyzed, relative to os.path.join(project_root, code_path_prefix). Defaults to None, which means no file has been scanned.

Type

str, path, None

code_path_prefix

Relative path to the code directory, starting from project_root. Defaults to None, meaning all code paths are relative to project_root.

Type

str, path, None

data_path_prefix

Relative path to the data directory, starting from project_root. Defaults to None, meaning all data paths are relative to project_root.

Type

str, path, None

project_root

Root of the project. If your project has multiple roots, I can’t help you friend.

Type

str, path

input_files

Set of input files scanned from comments in the code.

Type

set

input_datasets

Set of input datasets scanned from comments in the code.

Type

set

output_datasets

Set of output datasets scanned from comments in the code.

Type

set

Private Attributes:

_language (str): Short name for the language of the CodeParser. _extension (str): File extension of the code for this language. _file_match_regex (re): Regular expression to match files to be checked by this parser. Default: *._extension. _file_encoding (str): Encoding of the file to be opened (passed to open(path, encoding=self._file_encoding)) _comment_start (str): The string to search for demarking the start of the comment. _comment_start_regex (bool): Flag denoting the _comment_start variable is a regular expression

(pre-compiled or string to be complied).

_comment_end (str): The string to search for demarking the end of the comment. _comment_end_regex (bool): Flag denoting the _comment_end variable is a regular expression

(pre-compiled or string to be complied).

_ignore_comment_text (str): String denoting the ignore comment. _input_file_comment_text (str): String denoting the input file type. _input_dataset_comment_text (str): String denoting the input dataset type. _output_dataset_comment_text (str): String denoting the output dataset type. _ignore_comment_regex (re): Regular expression complied from the comment start/end attribute and ignore comment

text. Used to signal a particular file should be ignored.

_input_file_comment_regex (re): Regular expression complied from the comment start/end attribute and comment

text. Used to find input file comments in the code.

_input_dataset_comment_regex (re): Regular expression complied from the comment start/end attribute and comment

text. Used to find input dataset comments in the code.

_output_dataset_comment_regex (re): Regular expression complied from the comment start/end attribute and comment

text. Used to find output dataset comments in the code.

analyze(path_relative=None, path_absolute=None)[source]

Analyze the actual file.

Calls self.analyze_code(file_contents) after reading the file.

Parameters

path_relative (str) – Path to the file. Defaults to None, taking the path to analyze from path_relative.

Returns

Dictionary of the resulting code object dependencies

Return type

dict

Raises

UnicodeDecodeError – Raised if file is not encoded according to self._file_encoding (default: utf-8)

analyze_code(code)[source]

Analyze the text of the file.

Parameters

code (str) – Text of the file to be analyzed.

Returns

Returns True if the code had tags to parse.

Return type

bool

check_parent_relationships(potential_parent)[source]

Tests all outputs of provided potential parent to see if they match this object’s inputs.

Parameters

potential_parent (CodeParser) – Potential parent code object to test the outputs against self’s inputs.

Returns

List of overlapping files. These will be either the full_path of potential_parent or its

output datasets.

Return type

list

code_path_prefix = None

Relative path to the code directory, starting from project_root.

data_path_prefix = None

Relative path to the data directory, starting from project_root.

input_datasets = None

Set of input datasets scanned from comments in the code.

input_files = None

Set of input files scanned from comments in the code.

property is_parsed
classmethod matches(path_relative)[source]
matches_input(file_to_check)[source]

Asks “Is this one of your inputs”, for testing if this is your ‘child’.

Checks the input file against this code’s name (for a file dependency) and outputs (for a dataset dependency).

Parameters

file_to_check (str, path) – File to check against this one’s ‘outputs’

Returns

Returns True if file_to_check is in the input files or datasets.

Return type

(bool)

Example

parent = CodeParser(path_relative=”a.sas”) child = CodeParser(path_relative=”b.sas”)

for f in parent.output_datasets:

match = child.matches_input(f) if not match:

print(f”No relation for {f}”)

else:

print(f”{child.path_relative} is the {match} we are looking for”)

matches_output(file_to_check)[source]

“Are you my mother” test. Returns True-like if the file matches one of the ‘outputs’ of this code. Should be tested against inputs of other code to find a parent dependency.

Checks the input file against this code’s path_relative (for a file dependency) and dataset outputs (for a dataset dependency).

Parameters

file_to_check (str, path) – File to check against this one’s ‘outputs’, the relative path.

Returns

The path to the file or dataset that matches, otherwise None.

Return type

(str, None)

Example

parent = CodeParser(path_relative=”a.sas”) child = CodeParser(path_relative=”b.sas”)

for f in child.input_files | child.input_datasets:

match = parent.matches_output(f) if not match:

print(f”No relation for {f}”)

else:

print(f”{parent.path_relative} is the {match} we are looking for”)

output_datasets = None

Set of output datasets scanned from comments in the code.

path_absolute = None

Absolute path of the code to be analyzed. Defaults to None, which means no file has been scanned.

path_relative = None

Relative path of the code to be analyzed, relative to os.path.join(project_root, code_path_prefix). Defaults to None, which means no file has been scanned.

project_root = None

Root of the project. If your project has multiple roots, I can’t help you friend.

set_path(path_relative=None, path_absolute=None, project_root='.', code_path_prefix=None, data_path_prefix=None)[source]

Set the file path of the analyzed object, and calculate its relative position to base_dir.

Parameters
  • path_relative (str, path, None) – Path of the code to be analyzed, relative to os.path.join(project_root, code_path_prefix). Defaults to None.

  • path_absolute (str, path, None) – Absolute path of the code to be analyzed, from which path_relative is calculated. Ignored if path_relative is provided. Defaults to None.

  • code_path_prefix (str, path, None) – Relative path to the code directory, starting from project_root. Defaults to None, meaning all code paths are relative to project_root.

  • data_path_prefix (str, path, None) – Relative path to the data directory, starting from project_root. Defaults to None, meaning all data paths are relative to project_root.

  • project_root (str, path) – Root of the project. If your project has multiple roots, I can’t help you friend.

class reslib.automate.code_parser.CodeParserMetaclass[source]

Bases: type

Compile a CodeParser class to include the regex and parsing functionality.

This allows for SAS.matches instead of SAS().matches

class reslib.automate.code_parser.Manual(path_relative=None, path_absolute=None, project_root='.', code_path_prefix=None, data_path_prefix=None, **kwargs)[source]

Bases: reslib.automate.code_parser.CodeParser

Manual downloader is used to give instructions for a manual step that isn’t automated in code.

class reslib.automate.code_parser.Notebook(path_relative=None, path_absolute=None, project_root='.', code_path_prefix=None, data_path_prefix=None, **kwargs)[source]

Bases: reslib.automate.code_parser.CodeParser

class reslib.automate.code_parser.Python(path_relative=None, path_absolute=None, project_root='.', code_path_prefix=None, data_path_prefix=None, **kwargs)[source]

Bases: reslib.automate.code_parser.CodeParser

class reslib.automate.code_parser.SAS(path_relative=None, path_absolute=None, project_root='.', code_path_prefix=None, data_path_prefix=None, **kwargs)[source]

Bases: reslib.automate.code_parser.CodeParser

class reslib.automate.code_parser.Stata(path_relative=None, path_absolute=None, project_root='.', code_path_prefix=None, data_path_prefix=None, **kwargs)[source]

Bases: reslib.automate.code_parser.CodeParser

reslib.automate.scanner module

reslib.automate.scanner

This module contains the basic functionality to scan a code-base and extract dependencies from comments.

Assumes files look something like this:

┌────────────────────────┐
INPUT FILES —–> │ This file runs and │ –> This file (file path)
│ creates some output │
│ or writes data to │
INPUT DATASETS –> │ disk. │ –> OUTPUT DATASETS
└────────────────────────┘

Files can be ignored by adding the comment: RESLIB_IGNORE: True

Uses reslib.automate.code_parser.CodeParser objects to extract comments from code, then calculates the dependency graph. This stemmed from doit graph, but I wanted more flexibility.

copyright
  1. 2019 by Maclean Gaulin.

license

MIT, see LICENSE for more details.

class reslib.automate.scanner.DependencyScanner(*parsers, project_root='.', code_path_prefix=None, data_path_prefix=None, ignore_folders=None)[source]

Bases: object

Scan a code-base for dependencies.

project_root

Absolute path to project root. Will call os.path.abspath() if input is not already so.

Type

str, path

code_path_prefix

Prefix string to add to any code, used to resolve the absolute path via: `os.path.join(project_root, code_path_prefix, relative_code_path_from_comment)`.

Type

str, path, None

data_path_prefix

Prefix string to add to any data, used to resolve the absolute path via: `os.path.join(project_root, data_path_prefix, relative_data_path_from_comment)`.

Type

str, path, None

parser_list

List of CodeParser subclass objects (not instances!).

scanned_code

Result list of scanned CodeParser instances.

default_dot_attributes

Tuple of lines to be added to the .dot file output.

Private Attributes:

_scanned_code: List of scanned CodeParser instances. _ignore_folders: Set of folders (thus also sub-folders) to ignore.

Examples

Assume the following three files exist in the ~/projects/example folder:

```code/data.sas

/* INPUT_DATASET funda.sas7bdat / PROC EXPORT DATA=funda OUTFILE= “data/stata_data.dta”; RUN; / OUTPUT: stata_data.dta */

```

```code/load_data.do

/* INPUT_DATASET stata_data.dta */ use “data/stata_data.dta”

```

```code/analysis.do

/* INPUT_FILE: load_data.do */ do “code/load_data.do”

```

Then the following would create a graph output at pipeline.pdf:

from reslib.automate import DependencyScanner

# Just scan for SAS and Stata code, located in the code directory.
ds = DependencyScanner(project_root='~/projects/example/',
                      code_path_prefix='code', data_path_prefix='data')
print(ds)
ds.DAG_to_file("pipeline.pdf")

Alternatively, a one-liner on the commandline:

python -c “from reslib.automate import *;DependencyScanner(code_path_prefix=’data’, data_path_prefix=’code’).DAG_to_file(‘pipeline’)”

DAG(color_orphans=True, trim_dangling_data_nodes=True)[source]

Create the Directed Acyclic Graph (DAG) for the codebase.

Returns

DiGraph of the codebase, represented in networkX format.

Return type

networkx.DiGraph

DAG_to_file(filepath, G=None)[source]

Write a graphviz-style *.dot file to be converted into

Parameters

filepath (str,Path,File) – Path (or open file object) to write the .dot file to.

default_dot_attributes = {'edge': ['arrowsize=1.5'], 'graph': ['rankdir=LR'], 'node': ['style=filled']}

Default attributes to add to the .dot file output.

parser_list = None

List of Parsers to check against cost. Defaults to [SAS, Stata, Notebook, Python]

scan()[source]

Scan through the directory starting from self.project_root (or override_path if provided), calling analyze(file) for each file that matches *.extension.

The dir that is passed to parser.analyze is always based on what was passed in. If project_root is absolute, the parser will get absolute paths. If it is relative, it will get relatives paths.

Each CodeParser object contains four important values:

relative_path input_files input_datasets output_datasets

property scanned_code

Module contents

reslib.automate

This package facilitates using doit (pydoit.org) to automate data pipelines.

copyright
  1. 2019 by Maclean Gaulin.

license

MIT, see LICENSE for more details.

reslib.automate.cleanpath(path_to_clean, re_pathsep=re.compile('[\\\\]+'), re_dotstart=re.compile('^./|/$'))[source]

Clean a path by replacing \\ with /, and removing beginning ./ and trailing /

reslib.automate.pathjoin(*paths)[source]

Join, normalize, and clean a list of paths, allowing for ``None``s (filtered out)