Git in DataFrames

Links: notebook, html, PDF, python, slides, slides(2), GitHub

python + git + dataframe = git-pandas

from jyquickhelper import add_notebook_menu
add_notebook_menu()
%matplotlib inline

Repository

from gitpandas import Repository
tries = [".", "../..", "../../.."]
err = None
for t in tries:
    try:
        repo = Repository(working_dir=t, verbose=True)
        err = None
        break
    except Exception as e:
        err = e
        continue
if err is not None:
    import os
    raise Exception("issue in current folder '{0}'".format(os.getcwd())) from err
Repository [pyquickhelper] instantiated at directory: ../..
repo.branches()
branch local repository
0 master True pyquickhelper
1 master False pyquickhelper

One funny function. No idea if that gives a good estimation.

try:
    use = repo.hours_estimate()
except Exception as e:
    # Not always reliable.
    print(e)
    use = None
use
c:python370_x64libsite-packagesgitpandasrepository.py:461: UserWarning: Warning, extensions and ignore_dir will be deprecated in v2.0.0, please use ignore_globs instead
  warnings.warn('Warning, extensions and ignore_dir will be deprecated in v2.0.0, please use ignore_globs instead')
committer hours
0 sdpython 107.178611
1 ped4747 0.000000
2 azure provisioned user 2.483333
3 dupre 515.906667
4 Ensaegithubxd 0.000000
5 xavier dupré 568.081667
if use is not None:
    workdays = use.hours.sum() / 8
else:
    workdays = None
workdays
149.20628472222222

Not sure what this number reflects.

Logs

The following cane take some time depending on you repository size.

try:
    hist = repo.commit_history()
except Exception as e:
    # Not always reliable.
    print(e)
    import pandas
    hist = pandas.DataFrame()
hist.head()
author committer message lines insertions deletions net
date
2019-01-10 01:39:10 xavier dupré xavier dupré Update azure-pipelines.yml\n 9 6 3 3
2019-01-10 01:14:10 xavier dupré xavier dupré Update azure-pipelines.yml\n 2 1 1 0
2019-01-10 01:02:37 xavier dupré xavier dupré disable one test one azure\n 3 2 1 1
2019-01-09 23:11:41 xavier dupré xavier dupré Improves negative pattern in enumerate_noteboo... 3 2 1 1
2019-01-09 02:05:11 xavier dupré xavier dupré Fixes #228, fix missing jpg images in document... 9 5 4 1
try:
    histf = repo.file_change_history()
except Exception as e:
    # Not always reliable.
    print(e)
    import pandas
    histf = pandas.DataFrame(dict(filename=[""]))
histf.head()
author committer message rev filename insertions deletions
date
2019-01-10 01:39:10 xavier dupré xavier dupré Update azure-pipelines.yml\n 6e678cc37392708461276f828c339479fb3f8453 azure-pipelines.yml 6 3
2019-01-10 01:14:10 xavier dupré xavier dupré Update azure-pipelines.yml\n 597247297c51427087718dcaf5c86f25ec2fde43 azure-pipelines.yml 1 1
2019-01-10 01:02:37 xavier dupré xavier dupré disable one test one azure\n 44a0cb07ab107acb0354b26a4ecf4fad0b22fb31 _unittests/ut_imghelper/test_js2image.py 2 1
2019-01-09 23:11:41 xavier dupré xavier dupré Improves negative pattern in enumerate_noteboo... f834e60c1e6d2840ef9bb7d5c0f9b8fc0b26ef9f src/pyquickhelper/helpgen/stat_helper.py 2 1
2019-01-09 02:05:11 xavier dupré xavier dupré Fixes #228, fix missing jpg images in document... 25988bee7127bba30eac3d2d9ed26053a5c177f0 src/pyquickhelper/helpgen/utils_sphinx_doc.py 5 4
histf.tail()
author committer message rev filename insertions deletions
date
2013-12-28 02:32:34 sdpython sdpython first version, doc, unit test, setup\n b94d02d8e4bc124e7203c936cdde9570b8392d59 src/pyquickhelper/sync/file_tree_node.py 399 0
2013-12-28 02:32:34 sdpython sdpython first version, doc, unit test, setup\n b94d02d8e4bc124e7203c936cdde9570b8392d59 src/pyquickhelper/sync/synchelper.py 256 0
2013-12-28 02:32:34 sdpython sdpython first version, doc, unit test, setup\n b94d02d8e4bc124e7203c936cdde9570b8392d59 src/pyquickhelper/unittests/__init__.py 0 0
2013-12-28 02:32:34 sdpython sdpython first version, doc, unit test, setup\n b94d02d8e4bc124e7203c936cdde9570b8392d59 src/pyquickhelper/unittests/utils_tests.py 280 0
2013-12-28 02:32:34 sdpython sdpython first version, doc, unit test, setup\n b94d02d8e4bc124e7203c936cdde9570b8392d59 src/version.txt 1 0

Check removed files

unique = set(histf.filename)
len(unique)
1108
import os
sorted_unique = list(sorted(unique))
full_path = [os.path.join(repo.repo.working_dir, _) for _ in sorted_unique]
import numpy
exists = [os.path.exists(f) for f in full_path]
sizes = [os.stat(f).st_size if os.path.exists(f) else numpy.nan for f in full_path]
import pandas
removed = pandas.DataFrame(dict(name=sorted_unique, exists=exists, size=sizes))
removed.sort_values("size").dropna().tail()
name exists size
304 _unittests/ut_helpgen/notebooks_slides/js_boke... True 424724.0
999 src/pyquickhelper/sphinxext/revealjs/templates... True 476557.0
518 _unittests/ut_pycode/data/coverage/coverage_re... True 507382.0
611 _unittests/ut_sphinxext/data/video/mur.mp4 True 625634.0
232 _unittests/ut_helpgen/data/completion_profilin... True 1402712.0

How many files not exist anymore?

removed[~removed.exists].shape
(216, 3)
removed[~removed.exists].head()
name exists size
0 "_unittests/ut_sync/data/bug/bugged/Pr\303\251... False NaN
1 "_unittests/ut_sync/data/bug/bugged/Pr\303\251... False NaN
2 .circle.yml False NaN
3 .circle.yml => circle.yml False NaN
5 .coveragerc False NaN
g = repo.repo.git()
print(g.execute('git log --log-size --abbrev --follow "build_script.bat"'))
commit 7b34578848f96fea15c1fc96c1cece354f87ba90
log size 115
Author: xavier dupré <xavier.dupre@gmail.com>
Date:   Sun Jul 22 12:36:29 2018 +0200
    update build_script.bat
commit 27b203dd00658808925ab0ed24e91e7f34609254
log size 122
Author: xavier dupré <xavier.dupre@gmail.com>
Date:   Thu May 17 00:45:25 2018 +0200
    documentation, setup, circleci
commit 56950d8a1d41163c72e8cb7f1aac45dc2d4e5789
log size 112
Author: xavier dupré <xavier.dupre@gmail.com>
Date:   Mon Jan 8 13:22:00 2018 +0100
    fix failing unit test
commit 69c1a226c2f5724102ca2e5f22a1d6a126e701be
log size 104
Author: xavier dupré <xavier.dupre@gmail.com>
Date:   Fri Dec 1 00:50:58 2017 +0100
    fix appeveyor
commit f6e286c70b44bac822a51eb38cf48f4c387947dc
log size 105
Author: xavier dupré <xavier.dupre@gmail.com>
Date:   Wed Nov 29 12:25:19 2017 +0100
    update script
commit eba47ba0e8fb6cd7b10f62628a6ae71731b7418f
log size 105
Author: xavier dupré <xavier.dupre@gmail.com>
Date:   Sat Mar 11 22:24:45 2017 +0100
    update python
commit 0ca2465e8c2f68762442df1920fbc6300ef09a60
log size 102
Author: dupre <xavier.dupre@ensae.fr>
Date:   Mon Apr 25 21:18:18 2016 +0200
    update build_script
commit d0dd08cdd12995b4a3e2423b8434bbfcd38d4e66
log size 120
Author: dupre <xavier.dupre@ensae.fr>
Date:   Fri Sep 18 00:50:44 2015 +0200
    revert to python 3.4 build_script.bat
commit 0e2306bf7fc99dc15c6f6d0f8a2abeb74bb6b893
log size 137
Author: dupre <xavier.dupre@ensae.fr>
Date:   Thu Sep 17 22:05:00 2015 +0200
    update to deal with Python 3.5, minimizes dependencies
commit 9e6fd6acd73670a3b2aa4098f33f015f5c1ae8a7
log size 155
Author: dupre <xavier.dupre@ensae.fr>
Date:   Sun May 3 12:49:38 2015 +0200
    add the script automation to the module, it automatically creates scripts
from pyquickhelper.loghelper.repositories.pygit_helper import get_repo_log
res = get_repo_log(repo.repo.working_dir)
res[0]
['xavier dupré',
 '6e678cc3',
 datetime.datetime(2019, 1, 10, 0, 0),
 'Update azure-pipelines.yml',
 '6e678cc37392708461276f828c339479fb3f8453',
 'https://github.com/sdpython/pyquickhelper/commit/6e678cc37392708461276f828c339479fb3f8453']
df = pandas.DataFrame(data=res, columns="owner hash datetime comment full_hash path".split())
df.head()
owner hash datetime comment full_hash path
0 xavier dupré 6e678cc3 2019-01-10 Update azure-pipelines.yml 6e678cc37392708461276f828c339479fb3f8453 https://github.com/sdpython/pyquickhelper/comm...
1 xavier dupré 59724729 2019-01-10 Update azure-pipelines.yml 597247297c51427087718dcaf5c86f25ec2fde43 https://github.com/sdpython/pyquickhelper/comm...
2 xavier dupré 44a0cb07 2019-01-10 disable one test one azure 44a0cb07ab107acb0354b26a4ecf4fad0b22fb31 https://github.com/sdpython/pyquickhelper/comm...
3 xavier dupré f834e60c 2019-01-09 Improves negative pattern in enumerate_noteboo... f834e60c1e6d2840ef9bb7d5c0f9b8fc0b26ef9f https://github.com/sdpython/pyquickhelper/comm...
4 xavier dupré 25988bee 2019-01-09 Fixes #228, fix missing jpg images in document... 25988bee7127bba30eac3d2d9ed26053a5c177f0 https://github.com/sdpython/pyquickhelper/comm...
res = get_repo_log(repo.repo.working_dir, file_detail=True)
res[0]
('xavier dupré',
 '6e678cc3',
 datetime.datetime(2019, 1, 10, 0, 0),
 'Update azure-pipelines.yml',
 '6e678cc37392708461276f828c339479fb3f8453',
 'https://github.com/sdpython/pyquickhelper/commit/6e678cc37392708461276f828c339479fb3f8453',
 'azure-pipelines.yml',
 9,
 0)
df = pandas.DataFrame(data=res, columns="owner hash datetime comment full_hash path name net bytes".split())
df.head()
owner hash datetime comment full_hash path name net bytes
0 xavier dupré 6e678cc3 2019-01-10 Update azure-pipelines.yml 6e678cc37392708461276f828c339479fb3f8453 https://github.com/sdpython/pyquickhelper/comm... azure-pipelines.yml 9 0
1 xavier dupré 59724729 2019-01-10 Update azure-pipelines.yml 597247297c51427087718dcaf5c86f25ec2fde43 https://github.com/sdpython/pyquickhelper/comm... azure-pipelines.yml 2 0
2 xavier dupré 44a0cb07 2019-01-10 disable one test one azure 44a0cb07ab107acb0354b26a4ecf4fad0b22fb31 https://github.com/sdpython/pyquickhelper/comm... _unittests/ut_imghelper/test_js2image.py 3 0
3 xavier dupré f834e60c 2019-01-09 Improves negative pattern in enumerate_noteboo... f834e60c1e6d2840ef9bb7d5c0f9b8fc0b26ef9f https://github.com/sdpython/pyquickhelper/comm... src/pyquickhelper/helpgen/stat_helper.py 3 0
4 xavier dupré 25988bee 2019-01-09 Fixes #228, fix missing jpg images in document... 25988bee7127bba30eac3d2d9ed26053a5c177f0 https://github.com/sdpython/pyquickhelper/comm... src/pyquickhelper/helpgen/utils_sphinx_doc.py 9 0
df["ext"] = df.name.apply(lambda x: os.path.splitext(x)[-1].strip())
df.head(n=2)
owner hash datetime comment full_hash path name net bytes ext
0 xavier dupré 6e678cc3 2019-01-10 Update azure-pipelines.yml 6e678cc37392708461276f828c339479fb3f8453 https://github.com/sdpython/pyquickhelper/comm... azure-pipelines.yml 9 0 .yml
1 xavier dupré 59724729 2019-01-10 Update azure-pipelines.yml 597247297c51427087718dcaf5c86f25ec2fde43 https://github.com/sdpython/pyquickhelper/comm... azure-pipelines.yml 2 0 .yml
gr = df[df.ext.isin((".py", ".ipynb", ".txt", ".zip", ".yml"))].groupby("ext").sum()
gr.T
ext .ipynb .py .txt .yml .zip
net 67490 184556 3604 1632 0
bytes 0 0 0 0 179637
gr.plot(kind="bar");
../_images/git_dataframes_30_0.png