---
name: structured-log-analysis
description: Analyze structured text files (logs, notes, study records) for duplicates, patterns, and content comparison using awk/grep techniques.
version: 1
triggers:
  - analyzing logs or notes for duplicates
  - checking if repeated entries have identical content
  - deduplication analysis of structured text files
  - finding patterns in repetitive log files
  - comparing content blocks across records
---

# Structured Log/Note Analysis

Techniques for analyzing structured text files (logs, notes, study records) for duplicates, patterns, and content comparison.

## When to Use

- User asks to check for duplicates in a log/note file
- Analyzing cron job outputs that may have repetitive content
- Comparing records across time in structured markdown/text files

## Key Distinction: Structure vs Content Duplicates

**PITFALL**: Don't confuse structural duplicates with content duplicates!

- **Structural duplicate**: Same section header/marker appears multiple times
- **Content duplicate**: The actual text within those sections is identical

Always verify BOTH levels when user asks about duplicates.

## Analysis Workflow

### Step 1: Count structural markers

```bash
# Count section separators (e.g., "======" lines)
grep -c "^======" file.md

# Count specific section headers
grep -c "章节名称" file.md
```

### Step 2: Check section distribution

```bash
# Extract section titles and count occurrences
grep -o "科目 - [^=]*" file.md | sort | uniq -c | sort -rn
```

### Step 3: Compare actual content (CRITICAL)

Use `awk` to extract content blocks from different occurrences and compare:

```bash
# Extract specific records by occurrence number and compare
awk '
/^=+$/ { record_num++; next }
/目标章节名/ { 
    matches++
    if (matches == 1 || matches == 2 || matches == N) {
        print "===== 记录 #" matches " ====="
        capture = 1
        lines = 0
    }
}
capture && lines < 20 { print; lines++ }
capture && lines >= 20 { capture = 0 }
' file.md
```

### Step 4: Extract unique content patterns

```bash
# Extract a specific field and count unique values
awk '
/## 📝 题目1/ { in_q = 1; next }
in_q && /^```$/ { in_code = !in_code; next }
in_q && in_code { 
    q = substr($0, 1, 100)
    if (length(q) > 20) {
        questions[q]++
        in_q = 0; in_code = 0
    }
}
END { for (q in questions) print questions[q], q }
' file.md | sort -rn | head -10
```

## Reporting Results

When reporting duplicates, always specify:
1. Total record count
2. Number of unique sections/chapters
3. Whether content within repeated sections is identical or varies
4. Example comparison showing the difference (or sameness)

## Deduplication Script

For files with timestamped section headers, use the bundled script:

```bash
python ~/.hermes/skills/data-science/structured-log-analysis/scripts/dedup_by_section.py \
    input.md output.md
```

Keeps only the LAST occurrence of each section (most recent/complete).
Works with headers like `【2026-05-18 00:45】科目 - 章节名`.

Custom header pattern:
```bash
python dedup_by_section.py input.md output.md \
    --header-pattern "^## Section: (.+)$"
```

## Common Patterns

| File Type | Separator | Content Marker |
|-----------|-----------|----------------|
| Study notes | `=====` | `## 📖 知识点`, `## 📝 题目` |
| Error logs | Timestamp | Stack trace start |
| Cron outputs | Date header | Task name |

## Linked Files

- `scripts/dedup_by_section.py` — Python deduplication script for timestamped section headers