reslib.data package

Submodules

reslib.data.cache module

reslib.data.cache

This module contains the DatasetCache object for reading/writing cached datasets to disk.

copyright
  1. 2019 by Maclean Gaulin.

license

MIT, see LICENSE for more details.

class reslib.data.cache.DataFrameCache(override_filename=None, delete_cache=False)[source]

Bases: object

Base class for caching intermediate files.

Defaults to reading/writing dataset cache with pandas to_csv. Default write args: sep=” “, index=False Default read args: sep=” “

Suggested subclassing:

class CompustatFUNDA(DataFrameCache):
    override_directory = '~/project/data/comp/'
    filename 'funda'

    def make_dataset():
        # Download funda, return it as dataframe
        pass
property data

Property accessor for the underlying dataframe. Loads cached dataframe into memory, calling make_dataset() if no cache is available.

delete_cache()[source]

Method for deleting cached file if it exists.

df = None

DataFrame of the data

filename = None

Override filename to name the dataset.

property is_cached

Boolean value for whether cached file exists at path.

make_dataset()[source]

Make dataset to be saved to cache.

Should return a dataframe.

override_directory = None

Override directory to store the dataset in.

path = None

Full path to the dataset.

read(read_args=None)[source]

Read df from cache, returning ‘cleaned’ df.

Calls: _pre_read_hook() before, and _post_read_hook(read_df) after.

Parameters

read_args (dict) – Dictionary of read-args to be passed to the read function, overriding those specified in self.read_args.

Returns

DataFrame which is passed through

_post_read_hook(df).

Return type

pandas.DataFrame

read_args = {'sep': '\t'}
write(df, overwrite_cache=False, write_args=None)[source]

Write df to cache, returning ‘cleaned’ df.

Parameters
  • df (pandas.DataFrame) – DataFrame to be written to disk, using the self.write_args and any override write_args if provided.

  • write_args (dict) – Dictionary of any write_args which will override self.write_args

write_args = {'index': False, 'sep': '\t'}
class reslib.data.cache.ReadWriteArgCopyToDescendants[source]

Bases: type

Make read_args and write_args inheret from parent without super() init code. I know about dangerous mutable properties, but doubt it will apply much. This is about useage by research academics, not massively parallel projects. Citation: https://stackoverflow.com/a/42036304/1959876

Example:

class Gramma(metaclass=ReadWriteArgCopyToDescendants):
    read_args = {'sep': '       '} # Let's say we just want a read_args at first

class Mom(Gramma):
    read_args = {'parse_dates': ['datadate', ]}
    pass

assert Mom().read_args == {'sep': '     ', 'parse_dates': ['datadate', ]}
assert Mom().write_args == {}

class Kid(Mom):
    write_args = {'sep': ','}
    pass

assert Kid().read_args == {'sep': '     ', 'parse_dates': ['datadate', ]}
assert Kid().write_args == {'sep': ','}

reslib.data.dataframe module

reslib.data.dataframe

This module provides a wrapper around the pandas DataFrame class for some convenience functions (like stata-style column indexing, etc.).

copyright
  1. 2019 by Maclean Gaulin.

license

MIT, see LICENSE for more details.

reslib.data.merges module

reslib.data.merges

This module contains code to merge common datasets (e.g. add permnos to gvkeys, etc.)

copyright
  1. 2019 by Maclean Gaulin.

license

MIT, see LICENSE for more details.

Module contents

reslib.data

This package contains the functionality related to downloading, caching, and loading datasets.

copyright
  1. 2019 by Maclean Gaulin.

license

MIT, see LICENSE for more details.