SXXXXXXX_PyUCC/doc/English-manual.md
2025-12-12 10:37:51 +01:00

658 lines
29 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# PyUCC User Manual
**Document Version:** 1.0
**Application:** PyUCC (Python Unified Code Counter)
---
## 1. Introduction
**PyUCC** is an advanced static code analysis tool. Its primary objective is to provide quantitative metrics on software development and, crucially, to track code evolution over time through a powerful **Differing** system.
### What is it for?
1. **Counting:** Knowing exactly how many lines of code, comments, and blank lines make up your project.
2. **Measuring:** Calculating software complexity and maintainability.
3. **Comparing:** Understanding exactly what changed between two versions (added/removed/modified files and how complexity has shifted).
---
## 2. Core Concepts
Before starting, it is useful to understand the key terms used in the application.
### 2.1 Baseline
A **Baseline** is an instant "snapshot" of your project at a specific moment in time.
* When you create a baseline, PyUCC saves a copy of the files and calculates all metrics.
* Baselines serve as reference points (benchmarks) for future comparisons.
### 2.2 Supported Metrics
* **SLOC (Source Lines of Code):**
* *Physical Lines:* Total lines in the file.
* *Code Lines:* Lines containing executable code.
* *Comment Lines:* Documentation lines.
* *Blank Lines:* Empty lines (often used for formatting).
* **Cyclomatic Complexity (CC):** Measures the complexity of the control flow (how many `if`, `for`, `while` statements, etc.). **Lower is better.**
* **Maintainability Index (MI):** An index from 0 to 100 estimating how easy the code is to maintain. **Higher is better** (above 85 is excellent, below 65 is problematic).
### 2.3 Profile
A **Profile** is a saved configuration that tells PyUCC:
* Which folders to analyze.
* Which languages to include (e.g., Python and C++ only).
* What to ignore (e.g., `venv`, `build` folders, temporary files).
---
## 3. User Interface (GUI)
The interface is divided into functional zones to keep the workflow organized.
1. **Top Bar:**
* **Profile** selection.
* Access to **Settings** and Profile Manager (**Manage**).
2. **Actions Bar:** The main buttons to start operations (`Scan`, `Countings`, `Metrics`, `Differing`).
3. **Progress Area:** Progress bar and file counter.
4. **Results Table:** The large central table where data appears.
5. **Log & Status:** At the bottom, a log panel to see what is happening and a status bar monitoring system resources (CPU/RAM).
---
## 4. Configuration and Settings
### 4.1 Configuration Files Location
PyUCC stores all configuration in the application directory (where the executable or source code is located):
- **`profiles.json`** - Stores all your analysis profiles (project paths, filters, ignore patterns)
- **`settings.json`** - Stores application settings (baseline directory, retention policy, duplicates parameters)
**Location:**
- If running from source: in the repository root folder (e.g., `C:\src\PyUcc\`)
- If running from compiled exe: in the same folder as the executable
**Advantages:**
- Fully portable application - copy the entire folder to move your setup
- Easy to backup - just backup the application folder
- No hidden files in user's home directory
### 4.2 Application Settings (Settings.json)
You can configure PyUCC's behavior through the **⚙️ Settings** menu in the top bar.
**Available Settings:**
1. **Baseline Directory**
- **What it is:** Where PyUCC stores baseline snapshots.
- **Default:** `baseline/` subfolder in the application directory.
- **Recommendation:** Use the default, or set a custom path if you want to store baselines on a network drive or separate disk.
- **Example:** `D:\ProjectBaselines\` or `\\server\share\baselines\`
2. **Max Baselines to Keep**
- **What it is:** Maximum number of baseline snapshots to retain per profile.
- **Default:** 5
- **Behavior:** When exceeded, PyUCC automatically deletes the oldest baselines.
- **Recommendation:**
- 3-5 for small projects
- 10+ for critical projects requiring long history
- 20+ if disk space is not a concern
3. **Zip Baselines**
- **What it is:** Whether to compress baseline snapshots as `.zip` files.
- **Default:** `false` (disabled)
- **Advantages when enabled:**
- Saves disk space (50-80% reduction for source code)
- Faster to transfer/backup
- **Disadvantages:**
- Slightly slower to create/compare (compression overhead)
- Cannot browse snapshot files directly
- **Recommendation:** Enable for large projects (>10,000 files) or when disk space is limited.
4. **Duplicates Settings** (stored automatically)
- Threshold, k-gram size, winnowing window
- These are saved when you use the Duplicates feature
- See Section 9 for detailed explanation
**How to Configure:**
1. Click **⚙️ Settings** in the top bar.
2. Set your preferred baseline directory (or leave default).
3. Set max baselines to keep.
4. Check "Zip baselines" if desired.
5. Click **Save**.
**First-Time Setup Recommendation:**
At first run, PyUCC will use sensible defaults:
- Baselines stored in `baseline/` subfolder
- Keep last 5 baselines
- No compression
**You should configure Settings if:**
- You want baselines on a different drive (e.g., network storage, external disk)
- You need to keep more baseline history
- You're running out of disk space and want compression
### 4.3 First Run & Profile Configuration
The first thing to do upon opening PyUCC is to define *what* to analyze.
1. Click on **⚙️ Manage...** in the top bar.
2. Click on **📝 New** to clear the fields.
3. Enter a **Name** for the profile (e.g., "My Backend Project").
4. In the **Paths** section, use **Add Folder** to select your code's root directory.
5. In the **Filter Extensions** section, select the languages you are interested in (e.g., Python, Java).
6. In the **Ignore patterns** box, you can keep the defaults (which already exclude `.git`, `__pycache__`, etc.).
7. Click **💾 Save**.
### 4.4 Simple Analysis (Scan, Countings, Metrics)
If you only want to analyze the current state without comparisons:
* **🔍 Scan:** Simply verifies which files are found based on the profile filters. Useful to check if you are including the right files.
* **🔢 Countings:** Analyzes every file and reports how many code, comment, and blank lines exist.
* **📊 Metrics:** Calculates Cyclomatic Complexity and Maintainability Index for each file.
> **Tip:** You can double-click on a file in the results table to open it in the built-in **File Viewer**, which provides syntax highlighting and a colored minimap (blue=code, green=comments).
### 4.5 The "Differing" Workflow (Comparison)
This is PyUCC's most powerful feature.
**Step A: Create the First Baseline**
1. Select your profile.
2. Click on **🔀 Differing**.
3. If this is the first time you analyze this project, PyUCC will notify you: *"No baseline found"*.
4. Confirm creation. PyUCC will take a "snapshot" of the project (Baseline).
**Step B: Work on the Code**
Now you can close PyUCC and work on your code (modify files, add new ones, delete old ones).
**Step C: Compare**
1. Reopen PyUCC and select the same profile.
2. Click on **🔀 Differing**.
3. This time, PyUCC detects an existing previous Baseline and asks which one to compare against (if you have multiple).
4. The result will be a table with specific color coding:
* **Green:** Added files or improved metrics.
* **Red:** Deleted files or worsened metrics (e.g., increased complexity).
* **Yellow/Orange:** Modified files.
* **Δ (Delta) Columns:** Show numerical differences (e.g., `+50` code lines, `-2` complexity).
> **Diff Viewer:** If you double-click a row in the Differing results, a window will open showing the two files side-by-side, highlighting exactly which lines changed.
---
## 5. Exemplary Use Cases
### Case 1: Refactoring
* **Goal:** You want to clean up code and ensure you haven't increased complexity.
* **Action:** Create a Baseline before starting. Perform refactoring. Run *Differing*.
* **Verification:** Check the **Δ avg_cc** column. If it is negative (e.g., `-0.5`), great! You reduced complexity. If **Δ comment_lines** is positive, you improved documentation.
### Case 2: Code Review
* **Goal:** A colleague added a new feature. What changed?
* **Action:** Run *Differing* against the previous master/main version.
* **Verification:** Sort by "Status". Immediately see **Added** (A) and **Modified** (M) files. Open the Diff Viewer on modified files to inspect specific lines.
---
## 6. Development Philosophy (For Developers)
PyUCC was built following rigorous software engineering principles, reflected in its stability and usage.
### 6.1 Clean Code & PEP8 Standards
The code adheres to the Python **PEP8** standard. This ensures that if you ever want to extend the tool or write automation scripts using the `core` modules, you will find readable, standardized, and predictable code.
### 6.2 Separation of Concerns (SoC)
The application is strictly divided into two parts:
1. **Core (`pyucc.core`):** Contains pure logic (scanning, metric calculation, diff algorithms). It knows nothing about the GUI.
2. **GUI (`pyucc.gui`):** Handles only visualization and user interaction.
**Philosophy:** This allows changing the interface without breaking the logic, or using the logic via command line without launching the GUI.
### 6.3 Non-Blocking UI (Worker Manager)
You may notice the interface never freezes, even when analyzing thousands of files.
This is thanks to the **WorkerManager**. All heavy operations are executed in separate background threads. The GUI receives updates via a thread-safe `queue`.
* **User Benefit:** You can always press "Cancel" if an operation takes too long.
### 6.4 Intelligent Matching Algorithm (Gale-Shapley)
In *Differing*, PyUCC doesn't just check if filenames are identical. It uses an algorithm inspired by the "Stable Marriage Problem" (Gale-Shapley) combined with Levenshtein distance on paths.
* **Philosophy:** If you move a file from one folder to another, the system attempts to recognize it as the *same* file moved, rather than marking one as "Deleted" and one as "Added".
### 6.5 Determinism
The system uses content hashing (SHA1/MD5) to optimize calculations (caching) and to determine if a file has *truly* changed, ignoring the filesystem modification timestamp if the content remains identical.
---
## 7. Troubleshooting Common Issues
* **Program finds no files:** Check the Profile Manager to see if the file extension is selected in the language list or if the folder is covered by "Ignore patterns".
* **Extreme slowness:** If you included folders with thousands of small non-code files (e.g., `node_modules` or image assets), add them to "Ignore patterns".
* **Empty Diff Viewer:** Ensure the source files still exist on disk. If you deleted the project folder after creating the baseline, the viewer cannot display the current file.
---
## 8. New Features (Since v1.0)
This release adds several capabilities that improve code-quality analysis, reproducibility of baselines, and duplicate detection across a codebase. Below is a concise description of what changed and how to use the new features.
### 8.1 Duplicate Detection (GUI + CLI)
- **What it does:** Finds exact and fuzzy duplicates across the project. Exact duplicates are detected by content hashing (SHA1). Fuzzy duplicates use k-gram fingerprinting with a winnowing step to create fingerprints, and a Jaccard similarity score to rank likely duplicates.
- **Parameters:** `k` (k-gram size), `window` (winnowing window), and `threshold` (percent similarity). Defaults are chosen for balanced precision/recall but can be adjusted.
- **How to run (GUI):** Use the new **Duplicates** button in the Actions bar (it appears before the Differ button). A dialog lets you choose extensions, the similarity threshold, and fingerprinting parameters. Settings persist between runs.
- **How to run (CLI):** `python -m pyucc duplicates <path> --threshold 5.0 --ext .py .c` prints a JSON structure with duplicates found.
- **Exports:** Results can be exported to CSV and to a UCC-style textual report placed inside baseline folders (when run during baseline creation).
### 8.2 UCC-style Duplicate and Differ Reports
- **Compact UCC-style table:** Differ now produces a compact table compatible with UCC-like output, including additional Δ (delta) columns: `ΔCode`, `ΔComm`, `ΔBlank`, `ΔFunc`, `ΔAvgCC`, `ΔMI`. This helps quickly see numeric changes in code, comments, blank lines, number of functions, average cyclomatic complexity and maintainability.
- **Duplicates report:** A textual `duplicates_report.txt` is generated (when requested) that lists duplicate groups with pairwise percent similarity and the parameters used to generate them. Baselines store the parameters so results are reproducible.
Example (compact UCC-style snippet):
```
File Code Comm Blank Func AvgCC MI ΔCode ΔComm ΔBlank ΔFunc ΔAvgCC ΔMI
---------------------------------------------------------------------------------------------------------------
src/module/a.py 120 10 8 5 2.3 78 +10 -1 0 +0 -0.1 +2
src/module/b_copy.py 118 8 10 5 2.4 76 -2 -2 +2 0 +0.1 -2
```
### 8.3 Scanner & Baseline Improvements
- **Centralized scanning:** The `scanner` is the canonical provider of the file list. Heavy modules (Differ, Duplicates finder) accept a `file_list` produced by the scanner to avoid rescanning and to ensure consistent filtering.
- **Ignore pattern normalization:** Ignore entries like `.bak` are normalized to `*.bak` and matching is case-insensitive by default; this prevents accidental inclusion of backup files in baselines.
- **Baseline reproducibility:** Baselines now store the duplicates parameters and the file list snapshot. When a baseline is re-created or analyzed later, PyUCC attempts to re-run per-file function analysis (if `lizard` is available) so that function-level metrics in older baselines remain useful.
### 8.4 Notes on Dependencies
- Function-level metrics (number of functions, per-function CC) rely on `lizard`. If `lizard` is not installed, PyUCC will still produce SLOC and coarse metrics but function details may be missing. Baseline creation records this state and will re-run function analysis if `lizard` becomes available later.
---
## 9. Duplicate Detection: Algorithms and Technical Details
This section provides a deeper understanding of how PyUCC identifies duplicate code, what the algorithms do, and how to interpret the results.
### 9.1 Exact Duplicate Detection
**How it works:**
- PyUCC normalizes each file (strips leading/trailing whitespace from each line, converts to lowercase optionally).
- Computes a SHA1 hash of the normalized content.
- Files with identical hashes are considered exact duplicates.
**Use case:** Finding files that were copy-pasted with no or minimal changes (e.g., `utils.py` and `utils_backup.py`).
**What you'll see:**
- In the GUI table: pairs of files marked as "exact" duplicates with 100% similarity.
- In the report: listed under "Exact duplicates" section.
### 9.2 Fuzzy Duplicate Detection (Advanced)
Fuzzy detection identifies files that are *similar* but not identical. This is useful for finding:
- Code that was copy-pasted and then slightly modified.
- Refactored modules that share large blocks of logic.
- Experimental branches or "almost-duplicates" that should be merged.
**Algorithm Overview:**
1. **K-gram Hashing (Rolling Hash with Rabin-Karp):**
- Each file is divided into overlapping sequences of `k` consecutive lines (k-grams).
- A rolling hash (Rabin-Karp polynomial hash) is computed for each k-gram.
- This produces a large set of hash values representing all k-grams in the file.
2. **Winnowing (Fingerprint Selection):**
- To reduce the number of hashes (and improve performance), PyUCC applies a "winnowing" technique.
- A sliding window of size `w` moves over the hash sequence.
- In each window, the minimum hash value is selected as a fingerprint.
- This creates a compact set of representative fingerprints for the file.
- **Key property:** If two files share a substring of at least `k + w - 1` lines, they will share at least one fingerprint.
3. **Inverted Index:**
- All fingerprints from all files are stored in an inverted index: `{fingerprint -> [list of files containing it]}`.
- This allows fast lookup of which files share fingerprints.
4. **Jaccard Similarity:**
- For each pair of files that share at least one fingerprint, PyUCC computes the Jaccard similarity:
```
Jaccard(A, B) = |A ∩ B| / |A B|
```
- Where A and B are the sets of fingerprints for the two files.
- If the Jaccard score is above the threshold (default: 0.85, meaning 85% similarity), the pair is flagged as a fuzzy duplicate.
5. **Percent Change Calculation:**
- PyUCC also estimates the percentage of lines that differ between the two files.
- If `pct_change <= threshold` (e.g., ≤5%), the files are considered duplicates.
**Parameters you can adjust:**
- **`k` (k-gram size):** Number of consecutive lines in each k-gram. Default: 25.
- Larger `k` → fewer false positives, but may miss small duplicates.
- Smaller `k` → more sensitive, but may produce false positives.
- **`window` (winnowing window size):** Size of the window for selecting fingerprints. Default: 4.
- Larger window → fewer fingerprints, faster processing, but may miss some matches.
- Smaller window → more fingerprints, slower, but more thorough.
- **`threshold` (percent change threshold):** Maximum allowed difference (in %) to still consider two files duplicates. Default: 5.0%.
- Lower threshold → stricter matching (only very similar files).
- Higher threshold → more lenient (catches files with more differences).
**Recommended settings:**
| Use Case | k | window | threshold |
|----------|---|--------|----------|
| Strict duplicate finding (only near-identical files) | 30 | 5 | 3.0% |
| Balanced (default) | 25 | 4 | 5.0% |
| Loose matching (catch refactored code) | 20 | 3 | 10.0% |
| Very aggressive (experimental) | 15 | 2 | 15.0% |
### 9.3 Understanding Duplicate Reports
**GUI Table Columns:**
- **File A / File B:** The two files being compared.
- **Match Type:** "exact" or "fuzzy".
- **Similarity (%):** For fuzzy matches, the Jaccard similarity score (0-100%).
- **Pct Change (%):** Estimated percentage of lines that differ.
**Textual Report (duplicates_report.txt):**
The report is divided into two sections:
1. **Exact Duplicates:**
```
Exact duplicates: 3
src/utils.py <=> src/backup/utils_old.py
src/module/helper.py <=> src/module/helper - Copy.py
```
2. **Fuzzy Duplicates:**
```
Fuzzy duplicates (threshold): 5
src/processor.py <=> src/processor_v2.py
Similarity: 92.5% | Pct Change: 3.2%
src/core/engine.py <=> src/experimental/engine_new.py
Similarity: 88.0% | Pct Change: 4.8%
```
**Interpretation:**
- **High similarity (>95%):** Strong candidates for deduplication. Consider keeping only one version or merging.
- **Medium similarity (85-95%):** Review manually. May indicate refactored code or intentional variations.
- **Threshold violations:** Files that exceed the `pct_change` threshold won't appear in the report, even if they share some fingerprints.
---
## 10. Reading and Interpreting Differ Reports
The Differ functionality produces several types of output. Understanding each helps you track code evolution accurately.
### 10.1 Compact UCC-Style Table
When you run *Differing*, PyUCC generates a compact summary table similar to the original UCC tool:
**Example:**
```
File Code Comm Blank Func AvgCC MI ΔCode ΔComm ΔBlank ΔFunc ΔAvgCC ΔMI
---------------------------------------------------------------------------------------------------------------
src/module/a.py 120 10 8 5 2.3 78 +10 -1 0 +0 -0.1 +2
src/module/b.py 118 8 10 5 2.4 76 -2 -2 +2 0 +0.1 -2
src/new_feature.py 45 5 3 2 1.8 82 +45 +5 +3 +2 +1.8 +82
src/old_code.py -- -- -- -- -- -- -30 -5 -2 -1 -2.1 -75
```
**Column Meanings:**
| Column | Meaning |
|--------|--------|
| **File** | Relative path to the file |
| **Code** | Current number of code lines |
| **Comm** | Current number of comment lines |
| **Blank** | Current number of blank lines |
| **Func** | Number of functions detected (requires `lizard`) |
| **AvgCC** | Average cyclomatic complexity per function |
| **MI** | Maintainability Index (0-100, higher is better) |
| **ΔCode** | Change in code lines (current - baseline) |
| **ΔComm** | Change in comment lines |
| **ΔBlank** | Change in blank lines |
| **ΔFunc** | Change in function count |
| **ΔAvgCC** | Change in average cyclomatic complexity |
| **ΔMI** | Change in maintainability index |
**Color Coding (GUI):**
- **Green rows:** New files (Added) or improved metrics (e.g., ΔAvgCC < 0, ΔMI > 0).
- **Red rows:** Deleted files or worsened metrics (e.g., ΔAvgCC > 0, ΔMI < 0).
- **Yellow/Orange rows:** Modified files with mixed changes.
- **Gray rows:** Unmodified files (identical to baseline).
**What to look for:**
- **ΔCode >> 0:** Significant code expansion. Is it justified by new features?
- **ΔComm < 0:** Documentation decreased. Consider adding more comments.
- **ΔAvgCC > 0:** Complexity increased. May indicate need for refactoring.
- **ΔMI < 0:** Maintainability worsened. Review the changes.
- **New files with high AvgCC:** New code is already complex. Flag for review.
### 10.2 Detailed Diff Report (diff_report.txt)
A textual report is saved in the baseline folder:
**Structure:**
```
PyUCC Baseline Comparison Report
=================================
Baseline ID: MyProject__20251205T143022_local
Snapshot timestamp: 2025-12-05 14:30:22
Summary:
New files: 3
Deleted files: 1
Modified files: 12
Unchanged files: 45
Metric Changes:
Total Code Lines: +150
Total Comments: -5
Average CC: +0.2 (slight increase in complexity)
Average MI: -1.5 (slight decrease in maintainability)
[Compact UCC-style table here]
Legend:
A = Added file
D = Deleted file
M = Modified file
U = Unchanged file
...
```
### 10.3 CSV Exports
You can export any result table to CSV for further analysis in Excel, pandas, or BI tools.
**Columns include:**
- File path
- All SLOC metrics (code, comment, blank lines)
- Complexity metrics (CC, MI, function count)
- Deltas (if from a Differ operation)
- Status flags (A/D/M/U)
**Use cases:**
- Trend analysis over multiple baselines.
- Generating charts (e.g., complexity over time).
- Feeding into CI/CD quality gates.
---
## 11. Practical Use Cases and Workflows
### Use Case 1: Detecting Copy-Paste Code Before Code Review
**Scenario:** Your team is developing a new module. You suspect some developers copy-pasted existing code instead of refactoring.
**Workflow:**
1. Create a profile for your project.
2. Click **Duplicates** button.
3. Set threshold to 5% (strict).
4. Review the results table.
5. For each fuzzy duplicate pair:
- Double-click to open both files in the diff viewer (if implemented).
- Assess whether the duplication is intentional or should be refactored into a shared utility.
6. Export to CSV and share with the team for discussion.
**Expected outcome:** You identify 3-5 near-duplicate files and create tickets to consolidate them.
---
### Use Case 2: Tracking Complexity During a Refactoring Sprint
**Scenario:** Your team plans a 2-week refactoring sprint to reduce technical debt.
**Workflow:**
1. **Before the sprint:** Create a baseline ("Pre-Refactor").
- Click **Differing** Create baseline.
- Name it "PreRefactor_Sprint5".
2. **During the sprint:** Developers refactor code, extract functions, add comments.
3. **After the sprint:** Run **Differing** against the baseline.
4. Review the compact table:
- Check ΔAvgCC: Should be negative (complexity reduced).
- Check ΔMI: Should be positive (maintainability improved).
- Check ΔComm: Should be positive (more documentation).
5. Generate a diff report and attach to sprint retrospective.
**Expected outcome:** Quantitative proof that refactoring worked: "We reduced average CC by 15% and increased MI by 8 points."
---
### Use Case 3: Ensuring New Features Don't Degrade Quality
**Scenario:** You're adding a new feature to a mature codebase. You want to ensure the new code doesn't introduce excessive complexity.
**Workflow:**
1. Create a baseline before starting feature development.
2. Develop the feature in a branch.
3. Before merging to main:
- Run **Differing** to compare current state vs. baseline.
- Filter for new files (status = "A").
- Check AvgCC and MI of new files.
- If AvgCC > 5 or MI < 70, flag for refactoring before merge.
4. Use **Duplicates** to ensure new code doesn't duplicate existing utilities.
**Expected outcome:** New feature code passes quality gates before merge.
---
### Use Case 4: Generating Compliance Reports for Audits
**Scenario:** Your organization requires periodic code quality audits.
**Workflow:**
1. Create baselines monthly (e.g., "Audit_2025_01", "Audit_2025_02", ...).
2. Each baseline automatically generates:
- `countings_report.txt`
- `metrics_report.txt`
- `duplicates_report.txt`
3. Archive these reports in a compliance folder.
4. For the audit, provide:
- Trend of total SLOC over time.
- Trend of average CC and MI.
- Number of duplicates detected and resolved each month.
**Expected outcome:** Auditors see measurable improvement in code quality metrics over time.
---
### Use Case 5: Onboarding New Developers with Code Metrics
**Scenario:** A new developer joins the team and needs to understand the codebase.
**Workflow:**
1. Run **Metrics** on the entire codebase.
2. Export to CSV.
3. Sort by AvgCC (descending) to identify the most complex modules.
4. Share the list with the new developer:
- "These 5 files have the highest complexity. Be extra careful when modifying them."
- "These modules have low MI. They're candidates for refactoringgood learning exercises."
5. Use **Duplicates** to show which parts of the code have redundancy (explain why).
**Expected outcome:** New developer understands code hotspots and quality issues faster.
---
## 12. Tips for Effective Use
### 12.1 Profile Management
- **Create separate profiles** for different subprojects or components.
- Use **ignore patterns** aggressively to exclude:
- `node_modules`, `venv`, `.venv`
- Build outputs (`build/`, `dist/`, `bin/`)
- Generated code
- Test fixtures or mock data
### 12.2 Baseline Strategy
- **Naming convention:** Use descriptive names with dates or version tags:
- `Release_v1.2.0_20251201`
- `PreRefactor_Sprint10`
- `BeforeMerge_FeatureX`
- **Frequency:** Create baselines at key milestones:
- End of each sprint
- Before/after major refactorings
- Before releases
- **Retention:** Keep at least 3-5 recent baselines. Archive older ones.
### 12.3 Interpreting Metrics
**Cyclomatic Complexity (CC):**
- **1-5:** Simple, low risk.
- **6-10:** Moderate complexity, acceptable.
- **11-20:** High complexity, review recommended.
- **21+:** Very high complexity, refactoring strongly recommended.
**Maintainability Index (MI):**
- **85-100:** Highly maintainable (green zone).
- **70-84:** Moderately maintainable (yellow zone).
- **Below 70:** Low maintainability (red zone), needs attention.
### 12.4 Duplicate Detection Best Practices
- Start with **default parameters** (k=25, window=4, threshold=5%).
- If you get too many false positives, **increase k** or **decrease threshold**.
- If you suspect duplicates are being missed, **decrease k** or **increase threshold**.
- Always **review fuzzy duplicates manually**not all similarities are bad (e.g., interface implementations).
---
## 13. Troubleshooting and FAQs
**Q: Duplicates detection is slow on large codebases.**
**A:**
- Use profile filters to limit the file types analyzed.
- Increase `k` and `window` to reduce the number of fingerprints processed.
- Exclude large auto-generated files or test fixtures.
**Q: Why are some files missing function-level metrics?**
**A:**
- Function-level analysis requires `lizard`. Install it: `pip install lizard`.
- Some languages may not be fully supported by `lizard`.
**Q: Differ shows files as "Modified" but I didn't change them.**
**A:**
- Check if line endings changed (CRLF LF).
- Verify the file wasn't reformatted by an auto-formatter.
- PyUCC uses content hashingany byte-level change triggers "Modified" status.
**Q: How do I reset all baselines?**
**A:**
- Baselines are stored in the `baseline/` folder (default).
- Delete the baseline folder or specific baseline subdirectories to reset.
**Q: Can I run PyUCC in CI/CD pipelines?**
**A:**
- Yes! Use the CLI mode:
```bash
python -m pyucc differ create /path/to/repo
python -m pyucc differ diff <baseline_id> /path/to/repo
python -m pyucc duplicates /path/to/repo --threshold 5.0
```
- Parse the JSON output or text reports in your pipeline scripts.
---