nwb_project_analytics.codestats module

Module for computing code statistics using CLOC

class nwb_project_analytics.codestats.GitCodeStats(output_dir: str, git_paths: dict | None = None)

Bases: object

Class with functions to compute code statistics for repos stored on git

The typical use is to

>> git_code_stats = GitCodeStats(…) >> git_code_stats.compute_code_stats(…) >> git_code_stats.compute_summary_stats(…)

If results have been computed previously and cached then do:

>> git_code_stats = GitCodeStats.from_cache(…) >> git_code_stats.compute_summary_stats(…)

We can check if valid cached files exists via GitCodeStats.cached() or the from_cache function will raise a ValueError if the cache does not exists.

Variables:
  • git_paths – Dict of strings with the keys being the name of the tool and the values being the git URL, e.g,. ‘https://github.com/NeurodataWithoutBorders/pynwb.git’.

  • output_dir – Path to the directory where outputs are being stored

  • source_dir – Path wheter the sources of repos are being checked out to. (self.output_dir/src)

  • cache_file_cloc – Path to the YAML file for storing cloc statistics (may not exist if results are not cached)

  • cache_file_commits – Path to the YAML file with the commit stats (may not exist if results are not cached)

  • cloc_stats – Dict with the CLOC statistics

  • commit_stats – Dict with the commit statistics.

  • summary_stats – Dict with time-aligned summary statistics for all repos. The values of the dict are pandas.DataFrame objects and the keys are strings with the statistic type, i.e., ‘sizes’, ‘blank’, ‘codes’, ‘comment’, ‘nfiles’

  • contributors – Pandas dataframe wit contributors to the various repos determined via get_contributors and merge_contributors. NOTE: During calculation this will include the ‘email’ column with the emails corresponding to the ‘name’ of the user. However, when loading from cache the email column may not be available as a user may chose to not cache the email column, e.g., due to privacy concerns (even though the information is usually compiled from Git logs of public code repositories)

static cached(output_dir)

Check if a complete cached version of this class exists at output_dir

static clean_outdirs(output_dir, source_dir)

Delete the output directory and all its contents and create a new clean directory. Create a new source_dir.

Parameters:
  • output_dir – Output directory for caching results

  • source_dir – Directory for storing repos checked out from git

Returns:

A tuple of two strings with the output_dir and source_dir for git sources

static clone_repos(repos, source_dir)

Clone all of the given repositories.

Parameters:
  • repos – Dict where the keys are the names of the repos and the values are the git source path to clone

  • source_dir – Directory where all the git repos should be cloned to. Each repo will be cloned into a subdirectory in source_dir that is named after the corresponding key in the repos dict.

Returns:

Dict where the keys are the same as in repos but the values are instances of git.repo.base.Repo pointing to the corresponding git repository.

compute_code_stats(cloc_path: str, clean_source_dir: bool = False, contributor_params: dict | None = None)

Compute code statistics suing CLOC.

NOTE: Repos will be checked out from GitHub and CLOC computed for all

commits, i.e., the repo will be checked out at all commits of the repo and CLOC will be run. This process can be very expensive. Using the cache is recommended when possible.

WARNING: This function calls self.clean_outdirs. Any previously cached results will be lost!

Parameters:
  • cloc_path – Path to the cloc command for running cloc stats

  • clean_source_dir – Bool indicating whether to remove self.source_dir when finished

  • contributor_params – dict of string indicating additional command line parameters to pass to git shortlog. E.g., –since=”3 years”. Similarly we may specify –since, –after, –before and –until.

Returns:

None. The function initializes self.commit_stats, self.cloc_stats, and self.contributors

compute_language_stats(ignore_lang=None)

Compute for each code the breakdown in lines-of-code per language (including blank, comment, and code lines for each language).

The index of the resulting dataframe will typically be different for each code as changes occurred on different dates. The index reflects dates on which code changes occurred.

Parameters:

ignore_lang – List of languages to ignore. Usually [‘SUM’, ‘header’] are useful to ignore.

Returns:

Dictionary of pandas.DataFrame objects with the language stats for the different repos

compute_summary_stats(date_range)

Compile summary of line-of-code (LOC) across repos by categories: sizes, blanks, codes, comments, and nfiles

The goal is to align and expand results from all repos so that we can plot them together. Here we create a continuous date range and expand the results from all repos to align with our common time axis. For dates where no new CLOC stats are recorded for a repo, the statistics from the previous time are carried forward to fill in the gaps.

Parameters:

date_range (pandas.date_range) – Pandas datarange object for which the stats should be computed

Returns:

Dict where the values are Pandas DataFrame objects with summary statistics and the keys are strings with the statistic type, i.e., ‘sizes’, ‘blank’, ‘codes’, ‘comment’, ‘nfiles’

static from_cache(output_dir)

Create a GitCodeStats object from cached results :param output_dir: The output directory where the cache files are stored :return: A new GitCodeStats object with the resutls loaded from the cache

static from_nwb(cache_dir: str, cloc_path: str, start_date: datetime | None = None, end_date: datetime | None = None, read_cache: bool = True, write_cache: bool = True, cache_contributor_emails: bool = False, clean_source_dir: bool = True)

Convenience function to compute GitCodeStats statistics for all NWB git repositories defined by GitRepos.merge(NWBGitInfo.GIT_REPOS, NWBGitInfo.NWB1_GIT_REPOS)

For HDMF and the NDX_Extension_Smithy code statistics before the start date of the repo are set to 0. HDMF was extracted from PyNWB and as such, while there is code history before the official date for HDMF that part is history of PyNWB and so we set those values to 0. Similarly, the NDX_Extension_Smithy also originated from another code.

Parameters:
  • cache_dir – Path to the director where the files with the cached results are stored or should be written to

  • cloc_path – Path to the cloc shell command

  • start_date – Start date from which to compute summary statistics from. If set to None, then use NWBGitInfo.NWB2_START_DATE

  • end_date – End date until which to compute summary statistics to. If set to None, then use datetime.today()

  • read_cache – Bool indicating whether results should be loaded from cache if cached files exists at cache_dir. NOTE: If read_cache is True and files are in the cache then the results will be loaded without checking results (e.g., whether results in the cache are complete and up-to-date).

  • write_cache – Bool indicating whether to write the results to the cache.

  • cache_contributor_emails – Save the emails of contributors in the cached TSV file

  • clean_source_dir – Bool indicating whether to remove self.source_dir when finished computing the code stats. This argument only takes effect when code statistics are computed (i.e., not when data is loaded from cache)

Returns:

Tuple with the: 1) GitCodeStats object with all NWB code statistics and 2) dict with the results form GitCodeStats.compute_summary_stats 3) dict with language statistics computed via GitCodeStats.compute_language_stats 4) list of all languages used 5) dict with the release date and timeline statistics

static get_contributors(repo: Repo | str, contributor_params: str | None = None)

Compute list of contributors for the given repo using git shortlog –summary –numbered –email

Parameters:
  • repo – The git repository to process

  • contributor_params – String indicating additional command line parameters to pass to git shortlog. E.g., –since=”3 years”. Similarly we may specify –since, –after, –before and –until.

:return Pandas dataframe with the name, email, and number of contributions to the repo

get_languages_used(ignore_lang=None)

Get the list of languages used in the repos

Parameters:

ignore_lang – List of strings with the languages that should be ignored. (Default=None)

Returns:

array of strings with the unique set of languages used

static git_repo_stats(repo: Repo, cloc_path: str, output_dir: str)

Compute cloc statistics for the given repo.

Run cloc only for the last commit on each day to avoid excessive runs

Parameters:
  • repo – The git repository to process

  • cloc_path – Path to run cloc on the command line

  • output_dir – Path to the directory where outputs are being stored

Returns:

The function returns 2 elements, commit_stats and cloc_stats. commit_stats is a list of dicts with information about all commits. The list is sorted in time from most current [0] to oldest [-1]. cloc_stats is a list of dicts with CLOC code statistics. CLOC is run only on the last commit on each day to reduce the number of codecov runs and speed-up computation.

static merge_contributors(data_frames: dict, merge_duplicates: bool = True)

Take dict of dataframes generated by GitCodeStats.get_contributors and merge them into a single dataframe

Parameters:
  • data_frames – Dict of dataframes where the keys are the names of repos

  • merge_duplicates – Attempt to detect and merge duplicate contributors by name and email

Returns:

Combined pandas dataframe

static run_cloc(cloc_path, src_dir, out_file)

Run CLOC on the given srcdir, save the results to outdir, and return the parsed results.

write_to_cache(cache_contributor_emails: bool = False)

Save the stats to YAML and contributors to TSV.

Results will be saves to the self.cache_file_cloc, self.cloc_stats, self.cache_file_commits, self.cache_git_paths, and self.cache_contributors paths

Parameters:

cache_contributor_emails – Save the emails of contributors in the cached TSV file