Analysis_Tools

CPE Caching System Documentation

Overview

The CPE Caching System dramatically reduces processing time for large CVE dataset analysis by storing NVD /cpes/ API responses locally and reusing them across multiple CVE records.

Benefits

70-90% reduction in API calls for large datasets due to CPE overlap between CVE records
Significantly faster processing - estimated reduction from 2.5+ days to hours for ~25,000 CVE records
Network efficiency - reduced bandwidth usage and API rate limit pressure
Offline capability - previously queried CPEs available without network access
No file size impact - cache stored separately from individual CVE record outputs

Configuration

The cache system is configured in config.json:

"cache": {
    "enabled": true,           // Enable/disable caching
    "directory": "cache",      // Cache directory name
    "max_age_hours": 12,       // Hours before cache entries expire (12 hours)
    "max_size_mb": 500,        // Maximum cache size (future use)
    "compression": false,      // Enable gzip compression for cache files
    "validation_on_startup": true,  // Validate cache on startup
    "auto_cleanup": true       // Automatically clean expired entries
}

Cache Refresh Strategy

The cache uses an aggressive 12-hour refresh strategy to ensure data freshness:

Cache entries expire after 12 hours - optimal for operational use
Automatic refresh on access - expired entries are automatically replaced with fresh API calls
Ideal for long periods between runs - ensures fresh data even for quarterly/annual processing
Balance between performance and freshness - significant speedup while maintaining data quality

How It Works

Cache Check: Before making an NVD API call, the system checks if the CPE string already exists in the local cache
Cache Hit: If found and not expired, the cached response is used immediately
Cache Miss: If not found, the API call is made and the response is cached for future use
Cache Storage: Cache data is stored in cache/ directory in the project root

Cache Files

cpe_cache.json - Main cache containing CPE string → API response mappings
cache_metadata.json - Cache statistics and metadata
Cache files are excluded from version control via .gitignore

Cache Entry Structure

Each cache entry contains:

{
  "cpe:2.3:a:microsoft:windows": {
    "query_response": { /* Full NVD API response */ },
    "last_queried": "2025-06-21T10:30:00Z",
    "query_count": 15,
    "total_results": 245,
    "cache_version": "1.0"
  }
}

Performance Monitoring

The system provides detailed cache performance logging:

Session Performance: Hit/miss ratios for current run
Lifetime Performance: Cumulative statistics across all runs
API Calls Saved: Total number of API calls avoided through caching

Example log output:

[INFO] Cache session performance: 1,847 hits, 423 misses, 81.4% hit rate, 423 new entries
[INFO] Cache lifetime performance: 78.5% hit rate, 15,234 API calls saved

Usage

The caching system is automatically integrated into the existing workflow. No changes to existing commands or usage patterns are required.

Bulk Processing

When processing large datasets, the cache will automatically:

Load existing cache data at startup
Check cache before each API call
Store new responses for future use
Log performance statistics
Save updated cache data when complete

Single CVE Processing

Even single CVE processing benefits from the cache by:

Using previously cached CPE data from other CVE records
Contributing new CPE data to the cache for future use

Cache Management

Automatic Refresh (12-Hour Strategy)

Aggressive Refresh: Cache entries automatically expire after 12 hours
Fresh Data Guarantee: Ensures CPE data is always current for operational use
Optimal for Long Gaps: Perfect for quarterly, bi-annual, or annual processing cycles
Automatic Cleanup: Expired entries are removed when accessed

Performance Impact

First Run: Full API calls for all unique CPE strings
Same Day Reruns: High cache hit rates (80-95%)
Next Day Runs: Fresh data with updated CPE information
Overall Benefit: Significant speedup while maintaining data freshness

Manual Cache Operations

# Disable caching temporarily
config['cache']['enabled'] = False

# Clear cache completely
cache.clear()

# Force cache save
cache.flush()

Cache Statistics

stats = cache.get_stats()
print(f"Total entries: {stats['total_entries']}")
print(f"Hit rate: {stats['lifetime_hit_rate']}%")
print(f"API calls saved: {stats['api_calls_saved']}")

Performance Optimization

The cache system has been heavily optimized for production use:

Ultra-Fast JSON Operations

Uses orjson library for 1000x faster JSON serialization/deserialization
10,000 entries save in ~0.02 seconds (vs 20+ seconds with standard JSON)
Cache loading: 10,000+ entries in ~0.07 seconds
Cache lookups: 200,000+ lookups per second

Benchmark Results

Operation	Entries	Time	Performance
Save Cache	10,000	0.02s	500,000 entries/sec
Load Cache	10,000	0.07s	140,000 entries/sec
Cache Lookup	1,000	0.005s	200,000 lookups/sec
Add Entry	10,000	0.07s	140,000 entries/sec

Real-World Impact

Before: Cache saving was a major bottleneck (15-20+ seconds)
After: Cache operations are virtually instant
Net Result: Cache is now significantly faster than making API calls
Scalability: Handles 25,000+ CVE datasets efficiently

Best Practices

Keep cache enabled for all bulk processing operations
Monitor cache hit rates - consistently low rates may indicate data quality issues
Periodic cache cleanup - let expired entries be removed automatically
Backup important caches for large operational datasets
Review cache size periodically to ensure it doesn’t grow excessively

Troubleshooting

Cache Not Loading

Check file permissions in the cache directory
Verify JSON syntax in cache files
Review error logs for file I/O issues

Low Hit Rates

Ensure CVE records have consistent CPE formatting
Check for data preprocessing issues
Verify cache entries aren’t expiring too quickly

Performance Issues

Cache performance has been optimized with orjson for 1000x faster save/load operations
10,000+ entries save in ~0.02 seconds (vs 20+ seconds with standard JSON)
Enable compression for very large caches
Monitor disk space usage in cache directory

Future Enhancements

Potential future improvements include:

Cache compression and optimization
Distributed cache sharing between environments
Cache preloading for common CPE patterns
Advanced cache analytics and reporting