SXXXXXXX_PyUCC/doc/English-manual.md
2025-12-05 11:23:44 +01:00

12 KiB

PyUCC User Manual

Document Version: 1.0 Application: PyUCC (Python Unified Code Counter)


1. Introduction

PyUCC is an advanced static code analysis tool. Its primary objective is to provide quantitative metrics on software development and, crucially, to track code evolution over time through a powerful Differing system.

What is it for?

  1. Counting: Knowing exactly how many lines of code, comments, and blank lines make up your project.
  2. Measuring: Calculating software complexity and maintainability.
  3. Comparing: Understanding exactly what changed between two versions (added/removed/modified files and how complexity has shifted).

2. Core Concepts

Before starting, it is useful to understand the key terms used in the application.

2.1 Baseline

A Baseline is an instant "snapshot" of your project at a specific moment in time.

  • When you create a baseline, PyUCC saves a copy of the files and calculates all metrics.
  • Baselines serve as reference points (benchmarks) for future comparisons.

2.2 Supported Metrics

  • SLOC (Source Lines of Code):
    • Physical Lines: Total lines in the file.
    • Code Lines: Lines containing executable code.
    • Comment Lines: Documentation lines.
    • Blank Lines: Empty lines (often used for formatting).
  • Cyclomatic Complexity (CC): Measures the complexity of the control flow (how many if, for, while statements, etc.). Lower is better.
  • Maintainability Index (MI): An index from 0 to 100 estimating how easy the code is to maintain. Higher is better (above 85 is excellent, below 65 is problematic).

2.3 Profile

A Profile is a saved configuration that tells PyUCC:

  • Which folders to analyze.
  • Which languages to include (e.g., Python and C++ only).
  • What to ignore (e.g., venv, build folders, temporary files).

3. User Interface (GUI)

The interface is divided into functional zones to keep the workflow organized.

  1. Top Bar:
    • Profile selection.
    • Access to Settings and Profile Manager (Manage).
  2. Actions Bar: The main buttons to start operations (Scan, Countings, Metrics, Differing).
  3. Progress Area: Progress bar and file counter.
  4. Results Table: The large central table where data appears.
  5. Log & Status: At the bottom, a log panel to see what is happening and a status bar monitoring system resources (CPU/RAM).

4. Step-by-Step Guide

4.1 First Run & Profile Configuration

The first thing to do upon opening PyUCC is to define what to analyze.

  1. Click on ⚙️ Manage... in the top bar.
  2. Click on 📝 New to clear the fields.
  3. Enter a Name for the profile (e.g., "My Backend Project").
  4. In the Paths section, use Add Folder to select your code's root directory.
  5. In the Filter Extensions section, select the languages you are interested in (e.g., Python, Java).
  6. In the Ignore patterns box, you can keep the defaults (which already exclude .git, __pycache__, etc.).
  7. Click 💾 Save.

4.2 Simple Analysis (Scan, Countings, Metrics)

If you only want to analyze the current state without comparisons:

  • 🔍 Scan: Simply verifies which files are found based on the profile filters. Useful to check if you are including the right files.
  • 🔢 Countings: Analyzes every file and reports how many code, comment, and blank lines exist.
  • 📊 Metrics: Calculates Cyclomatic Complexity and Maintainability Index for each file.

Tip: You can double-click on a file in the results table to open it in the built-in File Viewer, which provides syntax highlighting and a colored minimap (blue=code, green=comments).

4.3 The "Differing" Workflow (Comparison)

This is PyUCC's most powerful feature.

Step A: Create the First Baseline

  1. Select your profile.
  2. Click on 🔀 Differing.
  3. If this is the first time you analyze this project, PyUCC will notify you: "No baseline found".
  4. Confirm creation. PyUCC will take a "snapshot" of the project (Baseline).

Step B: Work on the Code Now you can close PyUCC and work on your code (modify files, add new ones, delete old ones).

Step C: Compare

  1. Reopen PyUCC and select the same profile.
  2. Click on 🔀 Differing.
  3. This time, PyUCC detects an existing previous Baseline and asks which one to compare against (if you have multiple).
  4. The result will be a table with specific color coding:
    • Green: Added files or improved metrics.
    • Red: Deleted files or worsened metrics (e.g., increased complexity).
    • Yellow/Orange: Modified files.
    • Δ (Delta) Columns: Show numerical differences (e.g., +50 code lines, -2 complexity).

Diff Viewer: If you double-click a row in the Differing results, a window will open showing the two files side-by-side, highlighting exactly which lines changed.


5. Exemplary Use Cases

Case 1: Refactoring

  • Goal: You want to clean up code and ensure you haven't increased complexity.
  • Action: Create a Baseline before starting. Perform refactoring. Run Differing.
  • Verification: Check the Δ avg_cc column. If it is negative (e.g., -0.5), great! You reduced complexity. If Δ comment_lines is positive, you improved documentation.

Case 2: Code Review

  • Goal: A colleague added a new feature. What changed?
  • Action: Run Differing against the previous master/main version.
  • Verification: Sort by "Status". Immediately see Added (A) and Modified (M) files. Open the Diff Viewer on modified files to inspect specific lines.

6. Development Philosophy (For Developers)

PyUCC was built following rigorous software engineering principles, reflected in its stability and usage.

6.1 Clean Code & PEP8 Standards

The code adheres to the Python PEP8 standard. This ensures that if you ever want to extend the tool or write automation scripts using the core modules, you will find readable, standardized, and predictable code.

6.2 Separation of Concerns (SoC)

The application is strictly divided into two parts:

  1. Core (pyucc.core): Contains pure logic (scanning, metric calculation, diff algorithms). It knows nothing about the GUI.
  2. GUI (pyucc.gui): Handles only visualization and user interaction. Philosophy: This allows changing the interface without breaking the logic, or using the logic via command line without launching the GUI.

6.3 Non-Blocking UI (Worker Manager)

You may notice the interface never freezes, even when analyzing thousands of files. This is thanks to the WorkerManager. All heavy operations are executed in separate background threads. The GUI receives updates via a thread-safe queue.

  • User Benefit: You can always press "Cancel" if an operation takes too long.

6.4 Intelligent Matching Algorithm (Gale-Shapley)

In Differing, PyUCC doesn't just check if filenames are identical. It uses an algorithm inspired by the "Stable Marriage Problem" (Gale-Shapley) combined with Levenshtein distance on paths.

  • Philosophy: If you move a file from one folder to another, the system attempts to recognize it as the same file moved, rather than marking one as "Deleted" and one as "Added".

6.5 Determinism

The system uses content hashing (SHA1/MD5) to optimize calculations (caching) and to determine if a file has truly changed, ignoring the filesystem modification timestamp if the content remains identical.


7. Troubleshooting Common Issues

  • Program finds no files: Check the Profile Manager to see if the file extension is selected in the language list or if the folder is covered by "Ignore patterns".
  • Extreme slowness: If you included folders with thousands of small non-code files (e.g., node_modules or image assets), add them to "Ignore patterns".
  • Empty Diff Viewer: Ensure the source files still exist on disk. If you deleted the project folder after creating the baseline, the viewer cannot display the current file.

8. New Features (Since v1.0)

This release adds several capabilities that improve code-quality analysis, reproducibility of baselines, and duplicate detection across a codebase. Below is a concise description of what changed and how to use the new features.

8.1 Duplicate Detection (GUI + CLI)

  • What it does: Finds exact and fuzzy duplicates across the project. Exact duplicates are detected by content hashing (SHA1). Fuzzy duplicates use k-gram fingerprinting with a winnowing step to create fingerprints, and a Jaccard similarity score to rank likely duplicates.
  • Parameters: k (k-gram size), window (winnowing window), and threshold (percent similarity). Defaults are chosen for balanced precision/recall but can be adjusted.
  • How to run (GUI): Use the new Duplicates button in the Actions bar (it appears before the Differ button). A dialog lets you choose extensions, the similarity threshold, and fingerprinting parameters. Settings persist between runs.
  • How to run (CLI): python -m pyucc duplicates <path> --threshold 5.0 --ext .py .c prints a JSON structure with duplicates found.
  • Exports: Results can be exported to CSV and to a UCC-style textual report placed inside baseline folders (when run during baseline creation).

8.2 UCC-style Duplicate and Differ Reports

  • Compact UCC-style table: Differ now produces a compact table compatible with UCC-like output, including additional Δ (delta) columns: ΔCode, ΔComm, ΔBlank, ΔFunc, ΔAvgCC, ΔMI. This helps quickly see numeric changes in code, comments, blank lines, number of functions, average cyclomatic complexity and maintainability.
  • Duplicates report: A textual duplicates_report.txt is generated (when requested) that lists duplicate groups with pairwise percent similarity and the parameters used to generate them. Baselines store the parameters so results are reproducible.

Example (compact UCC-style snippet):

File                                   Code   Comm  Blank  Func  AvgCC   MI   ΔCode  ΔComm  ΔBlank  ΔFunc  ΔAvgCC  ΔMI
---------------------------------------------------------------------------------------------------------------
src/module/a.py                         120    10     8      5     2.3    78   +10    -1     0       +0     -0.1    +2
src/module/b_copy.py                    118     8     10     5     2.4    76   -2     -2     +2      0      +0.1   -2

8.3 Scanner & Baseline Improvements

  • Centralized scanning: The scanner is the canonical provider of the file list. Heavy modules (Differ, Duplicates finder) accept a file_list produced by the scanner to avoid rescanning and to ensure consistent filtering.
  • Ignore pattern normalization: Ignore entries like .bak are normalized to *.bak and matching is case-insensitive by default; this prevents accidental inclusion of backup files in baselines.
  • Baseline reproducibility: Baselines now store the duplicates parameters and the file list snapshot. When a baseline is re-created or analyzed later, PyUCC attempts to re-run per-file function analysis (if lizard is available) so that function-level metrics in older baselines remain useful.

8.4 Notes on Dependencies

  • Function-level metrics (number of functions, per-function CC) rely on lizard. If lizard is not installed, PyUCC will still produce SLOC and coarse metrics but function details may be missing. Baseline creation records this state and will re-run function analysis if lizard becomes available later.

If you want, I can add a short step-by-step example that shows how to create a baseline, run duplicates, and export a CSV + UCC-style report from the GUI and from the CLI. Would you like a full worked example with sample files and commands?