reslib.automate package¶
Subpackages¶
Submodules¶
reslib.automate.code_parser module¶
reslib.automate.code_parser¶
This module contains the basic functionality to parse dependencies from code. Currently it uses specially formatted comments to do so, hopefully one day it will automatically extract dependencies.
Assumes files look something like this:
┌────────────────────────┐INPUT FILES —–> │ This file runs and │ –> This file (file path)│ creates some output ││ or writes data to │INPUT DATASETS –> │ disk. │ –> OUTPUT DATASETS└────────────────────────┘
Parses comments that look like “# INPUT_FILE:” or “# INPUT_DATASET:” or “# OUTPUT_DATASET:” store them. Files can be ignored by adding the comment: “RESLIB_IGNORE: True”
- copyright
2019 by Maclean Gaulin.
- license
MIT, see LICENSE for more details.
-
class
reslib.automate.code_parser.
CodeParser
(path_relative=None, path_absolute=None, project_root='.', code_path_prefix=None, data_path_prefix=None, **kwargs)[source]¶ Bases:
object
CodeParser imagines a file as something that takes input, and makes output:
┌────────────────────────┐INPUT FILES —–> │ This file runs and │ –> This file (file path)│ creates some output ││ or writes data to │INPUT DATASETS –> │ disk. │ –> OUTPUT DATASETS└────────────────────────┘-
path_relative
¶ Path of the code to be analyzed, relative to
os.path.join(project_root, code_path_prefix)
. Defaults to None, which means no file has been scanned.
-
code_path_prefix
¶ Relative path to the code directory, starting from
project_root
. Defaults toNone
, meaning all code paths are relative toproject_root
.
-
data_path_prefix
¶ Relative path to the data directory, starting from
project_root
. Defaults toNone
, meaning all data paths are relative toproject_root
.
-
project_root
¶ Root of the project. If your project has multiple roots, I can’t help you friend.
- Type
str, path
- Private Attributes:
_language (str): Short name for the language of the CodeParser. _extension (str): File extension of the code for this language. _file_match_regex (re): Regular expression to match files to be checked by this parser. Default:
*._extension
. _file_encoding (str): Encoding of the file to be opened (passed toopen(path, encoding=self._file_encoding)
) _comment_start (str): The string to search for demarking the start of the comment. _comment_start_regex (bool): Flag denoting the_comment_start
variable is a regular expression(pre-compiled or string to be complied).
_comment_end (str): The string to search for demarking the end of the comment. _comment_end_regex (bool): Flag denoting the
_comment_end
variable is a regular expression(pre-compiled or string to be complied).
_ignore_comment_text (str): String denoting the ignore comment. _input_file_comment_text (str): String denoting the input file type. _input_dataset_comment_text (str): String denoting the input dataset type. _output_dataset_comment_text (str): String denoting the output dataset type. _ignore_comment_regex (re): Regular expression complied from the comment start/end attribute and ignore comment
text. Used to signal a particular file should be ignored.
- _input_file_comment_regex (re): Regular expression complied from the comment start/end attribute and comment
text. Used to find input file comments in the code.
- _input_dataset_comment_regex (re): Regular expression complied from the comment start/end attribute and comment
text. Used to find input dataset comments in the code.
- _output_dataset_comment_regex (re): Regular expression complied from the comment start/end attribute and comment
text. Used to find output dataset comments in the code.
-
analyze
(path_relative=None, path_absolute=None)[source]¶ Analyze the actual file.
Calls self.analyze_code(file_contents) after reading the file.
- Parameters
path_relative (str) – Path to the file. Defaults to None, taking the path to analyze from
path_relative
.- Returns
Dictionary of the resulting code object dependencies
- Return type
- Raises
UnicodeDecodeError – Raised if file is not encoded according to self._file_encoding (default: utf-8)
-
check_parent_relationships
(potential_parent)[source]¶ Tests all outputs of provided potential parent to see if they match this object’s inputs.
- Parameters
potential_parent (CodeParser) – Potential parent code object to test the outputs against self’s inputs.
- Returns
- List of overlapping files. These will be either the
full_path
ofpotential_parent
or its output datasets.
- List of overlapping files. These will be either the
- Return type
-
code_path_prefix
= None¶ Relative path to the code directory, starting from
project_root
.
-
data_path_prefix
= None¶ Relative path to the data directory, starting from
project_root
.
-
input_datasets
= None¶ Set of input datasets scanned from comments in the code.
-
input_files
= None¶ Set of input files scanned from comments in the code.
-
property
is_parsed
¶
-
matches_input
(file_to_check)[source]¶ Asks “Is this one of your inputs”, for testing if this is your ‘child’.
Checks the input file against this code’s name (for a file dependency) and outputs (for a dataset dependency).
- Parameters
file_to_check (str, path) – File to check against this one’s ‘outputs’
- Returns
Returns
True
if file_to_check is in the input files or datasets.- Return type
(bool)
Example
parent = CodeParser(path_relative=”a.sas”) child = CodeParser(path_relative=”b.sas”)
- for f in parent.output_datasets:
match = child.matches_input(f) if not match:
print(f”No relation for {f}”)
- else:
print(f”{child.path_relative} is the {match} we are looking for”)
-
matches_output
(file_to_check)[source]¶ “Are you my mother” test. Returns True-like if the file matches one of the ‘outputs’ of this code. Should be tested against inputs of other code to find a parent dependency.
Checks the input file against this code’s
path_relative
(for a file dependency) and dataset outputs (for a dataset dependency).- Parameters
file_to_check (str, path) – File to check against this one’s ‘outputs’, the relative path.
- Returns
The path to the file or dataset that matches, otherwise None.
- Return type
Example
parent = CodeParser(path_relative=”a.sas”) child = CodeParser(path_relative=”b.sas”)
- for f in child.input_files | child.input_datasets:
match = parent.matches_output(f) if not match:
print(f”No relation for {f}”)
- else:
print(f”{parent.path_relative} is the {match} we are looking for”)
-
output_datasets
= None¶ Set of output datasets scanned from comments in the code.
-
path_absolute
= None¶ Absolute path of the code to be analyzed. Defaults to None, which means no file has been scanned.
-
path_relative
= None¶ Relative path of the code to be analyzed, relative to
os.path.join(project_root, code_path_prefix)
. Defaults to None, which means no file has been scanned.
-
project_root
= None¶ Root of the project. If your project has multiple roots, I can’t help you friend.
-
set_path
(path_relative=None, path_absolute=None, project_root='.', code_path_prefix=None, data_path_prefix=None)[source]¶ Set the file path of the analyzed object, and calculate its relative position to base_dir.
- Parameters
path_relative (str, path, None) – Path of the code to be analyzed, relative to
os.path.join(project_root, code_path_prefix)
. Defaults to None.path_absolute (str, path, None) – Absolute path of the code to be analyzed, from which
path_relative
is calculated. Ignored ifpath_relative
is provided. Defaults to None.code_path_prefix (str, path, None) – Relative path to the code directory, starting from
project_root
. Defaults toNone
, meaning all code paths are relative toproject_root
.data_path_prefix (str, path, None) – Relative path to the data directory, starting from
project_root
. Defaults toNone
, meaning all data paths are relative toproject_root
.project_root (str, path) – Root of the project. If your project has multiple roots, I can’t help you friend.
-
-
class
reslib.automate.code_parser.
CodeParserMetaclass
[source]¶ Bases:
type
Compile a CodeParser class to include the regex and parsing functionality.
This allows for
SAS.matches
instead ofSAS().matches
-
class
reslib.automate.code_parser.
Manual
(path_relative=None, path_absolute=None, project_root='.', code_path_prefix=None, data_path_prefix=None, **kwargs)[source]¶ Bases:
reslib.automate.code_parser.CodeParser
Manual downloader is used to give instructions for a manual step that isn’t automated in code.
-
class
reslib.automate.code_parser.
Notebook
(path_relative=None, path_absolute=None, project_root='.', code_path_prefix=None, data_path_prefix=None, **kwargs)[source]¶
-
class
reslib.automate.code_parser.
Python
(path_relative=None, path_absolute=None, project_root='.', code_path_prefix=None, data_path_prefix=None, **kwargs)[source]¶
-
class
reslib.automate.code_parser.
SAS
(path_relative=None, path_absolute=None, project_root='.', code_path_prefix=None, data_path_prefix=None, **kwargs)[source]¶
reslib.automate.scanner module¶
reslib.automate.scanner¶
This module contains the basic functionality to scan a code-base and extract dependencies from comments.
Assumes files look something like this:
┌────────────────────────┐INPUT FILES —–> │ This file runs and │ –> This file (file path)│ creates some output ││ or writes data to │INPUT DATASETS –> │ disk. │ –> OUTPUT DATASETS└────────────────────────┘
Files can be ignored by adding the comment: RESLIB_IGNORE: True
Uses reslib.automate.code_parser.CodeParser objects to extract comments from code, then calculates the dependency graph.
This stemmed from doit graph
, but I wanted more flexibility.
- copyright
2019 by Maclean Gaulin.
- license
MIT, see LICENSE for more details.
-
class
reslib.automate.scanner.
DependencyScanner
(*parsers, project_root='.', code_path_prefix=None, data_path_prefix=None, ignore_folders=None)[source]¶ Bases:
object
Scan a code-base for dependencies.
-
project_root
¶ Absolute path to project root. Will call
os.path.abspath()
if input is not already so.- Type
str, path
-
code_path_prefix
¶ Prefix string to add to any code, used to resolve the absolute path via:
`os.path.join(project_root, code_path_prefix, relative_code_path_from_comment)`
.
-
data_path_prefix
¶ Prefix string to add to any data, used to resolve the absolute path via:
`os.path.join(project_root, data_path_prefix, relative_data_path_from_comment)`
.
-
parser_list
¶ List of
CodeParser
subclass objects (not instances!).
-
scanned_code
¶ Result list of scanned
CodeParser
instances.
-
default_dot_attributes
¶ Tuple of lines to be added to the .dot file output.
- Private Attributes:
_scanned_code: List of scanned
CodeParser
instances. _ignore_folders: Set of folders (thus also sub-folders) to ignore.
Examples
Assume the following three files exist in the
~/projects/example folder
:- ```code/data.sas
/* INPUT_DATASET funda.sas7bdat / PROC EXPORT DATA=funda OUTFILE= “data/stata_data.dta”; RUN; / OUTPUT: stata_data.dta */
Then the following would create a graph output at pipeline.pdf:
from reslib.automate import DependencyScanner # Just scan for SAS and Stata code, located in the code directory. ds = DependencyScanner(project_root='~/projects/example/', code_path_prefix='code', data_path_prefix='data') print(ds) ds.DAG_to_file("pipeline.pdf")
Alternatively, a one-liner on the commandline:
python -c “from reslib.automate import *;DependencyScanner(code_path_prefix=’data’, data_path_prefix=’code’).DAG_to_file(‘pipeline’)”
-
DAG
(color_orphans=True, trim_dangling_data_nodes=True)[source]¶ Create the Directed Acyclic Graph (DAG) for the codebase.
- Returns
DiGraph of the codebase, represented in networkX format.
- Return type
networkx.DiGraph
-
DAG_to_file
(filepath, G=None)[source]¶ Write a graphviz-style *.dot file to be converted into
- Parameters
filepath (str,Path,File) – Path (or open file object) to write the .dot file to.
-
default_dot_attributes
= {'edge': ['arrowsize=1.5'], 'graph': ['rankdir=LR'], 'node': ['style=filled']}¶ Default attributes to add to the .dot file output.
-
parser_list
= None¶ List of Parsers to check against cost. Defaults to [SAS, Stata, Notebook, Python]
-
scan
()[source]¶ Scan through the directory starting from
self.project_root
(oroverride_path
if provided), calling analyze(file) for each file that matches*.extension
.The dir that is passed to parser.analyze is always based on what was passed in. If project_root is absolute, the parser will get absolute paths. If it is relative, it will get relatives paths.
Each CodeParser object contains four important values:
relative_path input_files input_datasets output_datasets
-
property
scanned_code
¶
-
Module contents¶
reslib.automate¶
This package facilitates using doit (pydoit.org) to automate data pipelines.
- copyright
2019 by Maclean Gaulin.
- license
MIT, see LICENSE for more details.