aggiornato manuale e readme

2025-12-05 11:23:44 +01:00 · 2025-12-05 11:23:44 +01:00 · 79ed9c1d72
commit 79ed9c1d72
parent b1bb6231be
3 changed files with 91 additions and 3 deletions
--- a/README.md
+++ b/README.md
@ -55,7 +55,19 @@ python -m pyucc

 For a deep dive into usage, workflows, and configuration, please refer to the **User Manual**:

-👉 **[Read the User Manual](docs/MANUAL.md)**
+👉 **[English Manual](doc/English-manual.md)**
+
+👉 **[Manuale Italiano](doc/Italian-manual.md)**
+
+### What's New (Highlights)
+
+This release adds new capabilities to improve duplicate detection and make baseline reports more reproducible:
+
+- **Duplicate Detection:** Exact and fuzzy duplicate finding (GUI button + CLI). Exports to CSV and textual UCC-style report.
+- **UCC-style reports:** Compact table with additional delta columns (`ΔCode`, `ΔComm`, `ΔBlank`, `ΔFunc`, `ΔAvgCC`, `ΔMI`) for quick numeric comparison.
+- **Scanner & Baseline Improvements:** Centralized scanner, ignore-pattern normalization (e.g., `.bak` -> `*.bak`, case-insensitive), and baseline parameter storage for reproducibility.
+
+See the manuals for full usage examples and a worked example demonstrating baseline -> duplicates -> export.

 ### Technical Philosophy
 *   **Separation of Concerns:** Core logic is strictly separated from the GUI layer.
--- a/doc/English-manual.md
+++ b/doc/English-manual.md
@ -146,4 +146,42 @@ The system uses content hashing (SHA1/MD5) to optimize calculations (caching) an

 *   **Program finds no files:** Check the Profile Manager to see if the file extension is selected in the language list or if the folder is covered by "Ignore patterns".
 *   **Extreme slowness:** If you included folders with thousands of small non-code files (e.g., `node_modules` or image assets), add them to "Ignore patterns".
-*   **Empty Diff Viewer:** Ensure the source files still exist on disk. If you deleted the project folder after creating the baseline, the viewer cannot display the current file.
+*   **Empty Diff Viewer:** Ensure the source files still exist on disk. If you deleted the project folder after creating the baseline, the viewer cannot display the current file.
+
+---
+
+## 8. New Features (Since v1.0)
+
+This release adds several capabilities that improve code-quality analysis, reproducibility of baselines, and duplicate detection across a codebase. Below is a concise description of what changed and how to use the new features.
+
+### 8.1 Duplicate Detection (GUI + CLI)
+- **What it does:** Finds exact and fuzzy duplicates across the project. Exact duplicates are detected by content hashing (SHA1). Fuzzy duplicates use k-gram fingerprinting with a winnowing step to create fingerprints, and a Jaccard similarity score to rank likely duplicates.
+- **Parameters:** `k` (k-gram size), `window` (winnowing window), and `threshold` (percent similarity). Defaults are chosen for balanced precision/recall but can be adjusted.
+- **How to run (GUI):** Use the new **Duplicates** button in the Actions bar (it appears before the Differ button). A dialog lets you choose extensions, the similarity threshold, and fingerprinting parameters. Settings persist between runs.
+- **How to run (CLI):** `python -m pyucc duplicates <path> --threshold 5.0 --ext .py .c` prints a JSON structure with duplicates found.
+- **Exports:** Results can be exported to CSV and to a UCC-style textual report placed inside baseline folders (when run during baseline creation).
+
+### 8.2 UCC-style Duplicate and Differ Reports
+- **Compact UCC-style table:** Differ now produces a compact table compatible with UCC-like output, including additional Δ (delta) columns: `ΔCode`, `ΔComm`, `ΔBlank`, `ΔFunc`, `ΔAvgCC`, `ΔMI`. This helps quickly see numeric changes in code, comments, blank lines, number of functions, average cyclomatic complexity and maintainability.
+- **Duplicates report:** A textual `duplicates_report.txt` is generated (when requested) that lists duplicate groups with pairwise percent similarity and the parameters used to generate them. Baselines store the parameters so results are reproducible.
+
+Example (compact UCC-style snippet):
+
+```
+File                                   Code   Comm  Blank  Func  AvgCC   MI   ΔCode  ΔComm  ΔBlank  ΔFunc  ΔAvgCC  ΔMI
+---------------------------------------------------------------------------------------------------------------
+src/module/a.py                         120    10     8      5     2.3    78   +10    -1     0       +0     -0.1    +2
+src/module/b_copy.py                    118     8     10     5     2.4    76   -2     -2     +2      0      +0.1   -2
+```
+
+### 8.3 Scanner & Baseline Improvements
+- **Centralized scanning:** The `scanner` is the canonical provider of the file list. Heavy modules (Differ, Duplicates finder) accept a `file_list` produced by the scanner to avoid rescanning and to ensure consistent filtering.
+- **Ignore pattern normalization:** Ignore entries like `.bak` are normalized to `*.bak` and matching is case-insensitive by default; this prevents accidental inclusion of backup files in baselines.
+- **Baseline reproducibility:** Baselines now store the duplicates parameters and the file list snapshot. When a baseline is re-created or analyzed later, PyUCC attempts to re-run per-file function analysis (if `lizard` is available) so that function-level metrics in older baselines remain useful.
+
+### 8.4 Notes on Dependencies
+- Function-level metrics (number of functions, per-function CC) rely on `lizard`. If `lizard` is not installed, PyUCC will still produce SLOC and coarse metrics but function details may be missing. Baseline creation records this state and will re-run function analysis if `lizard` becomes available later.
+
+---
+
+If you want, I can add a short step-by-step example that shows how to create a baseline, run duplicates, and export a CSV + UCC-style report from the GUI and from the CLI. Would you like a full worked example with sample files and commands?
--- a/doc/Italian-manual.md
+++ b/doc/Italian-manual.md
@ -146,4 +146,42 @@ Il sistema usa l'hashing (SHA1/MD5) del contenuto dei file per ottimizzare i cal

 *   **Il programma non trova file:** Controlla nel Profile Manager se l'estensione del file è nella lista dei linguaggi o se la cartella è inclusa negli "Ignore patterns".
 *   **Lentezza estrema:** Se hai incluso cartelle con migliaia di file piccoli non di codice (es. `node_modules` o cartelle di immagini), aggiungile agli "Ignore patterns".
-*   **Diff Viewer vuoto:** Assicurati che i file sorgente esistano ancora sul disco. Se hai cancellato la cartella del progetto dopo aver fatto la baseline, il viewer non potrà mostrare il file corrente.
+*   **Diff Viewer vuoto:** Assicurati che i file sorgente esistano ancora sul disco. Se hai cancellato la cartella del progetto dopo aver fatto la baseline, il viewer non potrà mostrare il file corrente.
+
+---
+
+## 8. Nuove Funzionalità (Da v1.0)
+
+Questa release introduce funzionalità che migliorano l'analisi della qualità del codice, la riproducibilità delle baseline e la ricerca di duplicazioni nel codice. Di seguito una descrizione sintetica delle novità e come usarle.
+
+### 8.1 Rilevamento Duplicati (GUI + CLI)
+- **Cosa fa:** Individua duplicati esatti e fuzzy all'interno del progetto. I duplicati esatti sono individuati tramite hashing del contenuto (SHA1). I duplicati fuzzy usano fingerprinting a k-gram con una fase di winnowing e una misura di similarità Jaccard per identificare coppie simili.
+- **Parametri:** `k` (dimensione dei k-gram), `window` (finestra di winnowing) e `threshold` (soglia di similarità in percentuale). I valori di default sono bilanciati per precisione/recall ma possono essere modificati dall'utente.
+- **Esecuzione (GUI):** Usa il nuovo pulsante **Duplicates** nella barra Azioni (posizionato prima del pulsante Differ). Una finestra di dialogo permette di scegliere estensioni, soglia e parametri di fingerprinting. Le impostazioni sono persistenti.
+- **Esecuzione (CLI):** `python -m pyucc duplicates <path> --threshold 5.0 --ext .py .c` stampa in output JSON i duplicati trovati.
+- **Esportazione:** Risultati esportabili in `CSV` e in un report testuale in stile UCC inserito nella cartella baseline (quando eseguito durante la creazione della baseline).
+
+### 8.2 Report UCC-style per Duplicati e Differenze
+- **Tabella compatta in stile UCC:** Il differ ora può generare una tabella compatta simile all'output UCC, con colonne Δ (delta) aggiuntive: `ΔCode`, `ΔComm`, `ΔBlank`, `ΔFunc`, `ΔAvgCC`, `ΔMI`, per vedere rapidamente le variazioni numeriche.
+- **Report duplicati:** Viene creato un file testuale `duplicates_report.txt` (se richiesto) che elenca i gruppi di duplicati con la similarità percentuale e i parametri usati. Le baseline salvano i parametri per garantire la riproducibilità.
+
+Esempio (snippet compatto in stile UCC):
+
+```
+File                                   Code   Comm  Blank  Func  AvgCC   MI   ΔCode  ΔComm  ΔBlank  ΔFunc  ΔAvgCC  ΔMI
+---------------------------------------------------------------------------------------------------------------
+src/module/a.py                         120    10     8      5     2.3    78   +10    -1     0       +0     -0.1    +2
+src/module/b_copy.py                    118     8     10     5     2.4    76   -2     -2     +2      0      +0.1   -2
+```
+
+### 8.3 Scanner e Migliorie alle Baseline
+- **Scansione centralizzata:** Lo `scanner` è il fornitore canonico della lista file. Moduli pesanti (Differ, Duplicates) possono ricevere la `file_list` prodotta dallo scanner per evitare ricerche ripetute e garantire lo stesso filtro.
+- **Normalizzazione dei pattern di ignore:** Voci di ignore come `.bak` vengono normalizzate in `*.bak` e il matching è case-insensitive per default; questo evita di includere file di backup nelle baseline.
+- **Riproducibilità delle baseline:** Le baseline memorizzano i parametri usati per la ricerca duplicati e la lista dei file snapshot. Se in seguito viene installato `lizard`, PyUCC tenta di rieseguire l'analisi per ottenere informazioni sulle funzioni anche nelle baseline create in precedenza.
+
+### 8.4 Note sulle Dipendenze
+- Le metriche a livello di funzione (numero di funzioni, CC per funzione) richiedono `lizard`. Se `lizard` non è installato, PyUCC produrrà comunque SLOC e metriche di base, ma i dettagli per funzione potrebbero mancare. La creazione della baseline registra questo stato e tenterà una rianalisi se `lizard` diventa disponibile.
+
+---
+
+Se vuoi, posso aggiungere un esempio passo-passo che mostra come creare una baseline, eseguire la ricerca duplicati e esportare CSV + report UCC sia da GUI che da CLI. Vuoi che lo prepari con comandi e file di esempio?