SkySignal Agent
Official APM agent for monitoring Meteor.js applications with SkySignal.
Features
- System Metrics Monitoring - CPU, memory, disk, network, V8 heap, event loop utilization, and process resource usage
- Method Performance Traces - Track Meteor Method execution with operation-level profiling and COLLSCAN detection
- Publication Monitoring - Monitor publication performance and subscriptions
- Publication Efficiency Analysis - Detect over-fetching (missing field projections) and unbounded cursors
- Error Tracking - Automatic server-side and client-side error capture with browser context
- Log Collection - Capture
console.*and MeteorLog.*output with structured metadata and sampling - HTTP Request Monitoring - Track outgoing HTTP requests
- Outbound HTTP Instrumentation - Zero-patch
diagnostics_channeltracing for outbound HTTP/HTTPS andfetchrequests - Database Query Monitoring - MongoDB query performance tracking with COLLSCAN flagging
- Live Query Monitoring - Per-observer driver detection for Change Streams (Meteor 3.5+), oplog, and polling
- DNS Timing - Measure DNS resolution latency by wrapping
dns.lookupanddns.resolve - CPU Profiling - On-demand inspector-based CPU profiling when CPU exceeds a configurable threshold
- Deprecated API Detection - Track sync vs async Meteor API usage to guide Meteor 3.x migration
- Environment Snapshots - Periodic capture of package versions, Node.js flags, and OS metadata
- Vulnerability Scanning - Hourly
npm auditwith severity reporting for high/critical CVEs - Real User Monitoring (RUM) - Browser-side Core Web Vitals (LCP, FID, CLS, TTFB, FCP, TTI) with automatic performance warnings
- SPA Route Tracking - Automatic performance collection on every route change
- Session Tracking - 30-minute user sessions with localStorage persistence
- Browser Context - Automatic device, browser, OS, and network information collection
- Batch Processing - Efficient batching and async delivery to minimize performance impact
- Worker Thread Offloading - Optional
worker_threadspool for compression to keep the host event loop clear
Installation
Add the package to your Meteor application:
meteor add skysignal:agent
Quick Start
1. Get Your API Key
Sign up at SkySignal and create a new site to get your API key.
2. Configure the Agent
In your Meteor server startup code (e.g., server/main.js):
1import { Meteor } from 'meteor/meteor'; 2import { SkySignalAgent } from 'meteor/skysignal:agent'; 3 4Meteor.startup(() => { 5 // Configure the agent 6 SkySignalAgent.configure({ 7 apiKey: process.env.SKYSIGNAL_API_KEY || 'your-api-key-here', 8 enabled: true, 9 host: 'my-app-server-1', // Optional: defaults to hostname 10 appVersion: '1.2.3', // Optional: auto-detected from package.json 11 12 // Optional: Customize collection intervals 13 systemMetricsInterval: 60000, // 1 minute (default) 14 flushInterval: 10000, // 10 seconds (default) 15 batchSize: 50, // Max items per batch (default) 16 17 // Optional: Sampling for high-traffic apps 18 traceSampleRate: 1.0, // 100% of traces (reduce for high volume) 19 20 // Optional: Feature toggles 21 collectTraces: true, 22 collectMongoPool: true, 23 collectDDPConnections: true, 24 collectJobs: true 25 }); 26 27 // Start monitoring 28 SkySignalAgent.start(); 29});
3. Add to Settings File
For production, use Meteor settings. The agent auto-initializes from settings if configured:
settings-production.json:
1{ 2 "skysignal": { 3 "apiKey": "sk_your_api_key_here", 4 "enabled": true, 5 "host": "production-server-1", 6 "appVersion": "1.2.3", 7 "traceSampleRate": 0.5, 8 "collectTraces": true, 9 "collectMongoPool": true, 10 "collectDDPConnections": true, 11 "collectJobs": true, 12 "collectLogs": true, 13 "logLevels": ["warn", "error", "fatal"], 14 "logSampleRate": 0.5, 15 "captureIndexUsage": true, 16 "indexUsageSampleRate": 0.05, 17 "collectDnsTimings": true, 18 "collectOutboundHttp": true, 19 "collectCpuProfiles": true, 20 "cpuProfileThreshold": 80, 21 "collectDeprecatedApis": true, 22 "collectPublications": true, 23 "collectEnvironment": true, 24 "collectVulnerabilities": true 25 }, 26 "public": { 27 "skysignal": { 28 "publicKey": "pk_your_public_key_here", 29 "rum": { 30 "enabled": true, 31 "sampleRate": 0.5 32 }, 33 "errorTracking": { 34 "enabled": true, 35 "captureUnhandledRejections": true 36 } 37 } 38 } 39}
The agent auto-starts when it finds valid configuration in Meteor.settings.skysignal.
Manual initialization (optional):
1import { SkySignalAgent } from 'meteor/skysignal:agent'; 2 3Meteor.startup(() => { 4 // Only needed if not using settings auto-initialization 5 const config = Meteor.settings.skysignal; 6 7 if (config && config.apiKey) { 8 SkySignalAgent.configure(config); 9 SkySignalAgent.start(); 10 } else { 11 console.warn('⚠️ SkySignal not configured - monitoring disabled'); 12 } 13});
Configuration Options
API Configuration
| Option | Type | Default | Description |
|---|---|---|---|
apiKey | String | required | Your SkySignal API key (sk_ prefix) |
endpoint | String | https://dash.skysignal.app | SkySignal API endpoint |
enabled | Boolean | true | Enable/disable the agent |
Host & Version Identification
| Option | Type | Default | Description |
|---|---|---|---|
host | String | os.hostname() | Host identifier for this instance |
appVersion | String | Auto-detect | App version from package.json or manually configured |
buildHash | String | Auto-detect | Build hash for source map lookup. Auto-detects from BUILD_HASH or GIT_SHA environment variables |
Batching Configuration
| Option | Type | Default | Description |
|---|---|---|---|
batchSize | Number | 50 | Max items per batch before auto-flush |
batchSizeBytes | Number | 262144 | Max bytes (256KB) per batch |
flushInterval | Number | 10000 | Interval (ms) to flush batched data |
Sampling Rates
| Option | Type | Default | Description |
|---|---|---|---|
traceSampleRate | Number | 1.0 | Server trace sample rate (0-1). Set to 0.1 for 10% |
rumSampleRate | Number | 0.5 | RUM sample rate (0-1). 50% by default for high-volume |
Collection Intervals
| Option | Type | Default | Description |
|---|---|---|---|
systemMetricsInterval | Number | 60000 | System metrics collection interval (1 minute) |
mongoPoolInterval | Number | 60000 | MongoDB pool metrics interval (1 minute) |
collectionStatsInterval | Number | 300000 | Collection stats interval (5 minutes) |
ddpConnectionsInterval | Number | 30000 | DDP connection updates interval (30 seconds) |
jobsInterval | Number | 30000 | Background job stats interval (30 seconds) |
dnsTimingsInterval | Number | 60000 | DNS timing aggregation interval (1 minute) |
outboundHttpInterval | Number | 60000 | Outbound HTTP aggregation interval (1 minute) |
cpuProfileCheckInterval | Number | 30000 | CPU check interval for threshold profiling (30 seconds) |
deprecatedApisInterval | Number | 300000 | Deprecated API usage reporting interval (5 minutes) |
publicationsInterval | Number | 300000 | Publication efficiency reporting interval (5 minutes) |
environmentInterval | Number | 1800000 | Environment snapshot interval (30 minutes) |
vulnerabilitiesInterval | Number | 3600000 | Vulnerability scan interval (1 hour) |
Feature Flags
| Option | Type | Default | Description |
|---|---|---|---|
collectSystemMetrics | Boolean | true | Collect system metrics (CPU, memory, disk, network) |
collectTraces | Boolean | true | Collect method/publication traces |
collectErrors | Boolean | true | Collect errors and exceptions |
collectHttpRequests | Boolean | true | Collect HTTP request metrics |
collectMongoPool | Boolean | true | Collect MongoDB connection pool metrics |
collectCollectionStats | Boolean | true | Collect MongoDB collection statistics |
collectDDPConnections | Boolean | true | Collect DDP/WebSocket connection metrics |
collectLiveQueries | Boolean | true | Collect Meteor live query metrics (change streams, oplog, polling) |
collectJobs | Boolean | true | Collect background job metrics |
collectLogs | Boolean | true | Collect server-side logs from console and Meteor Log |
collectRUM | Boolean | false | Client-side RUM (disabled by default, requires publicKey) |
collectDnsTimings | Boolean | true | Collect DNS resolution latency by wrapping dns.lookup/dns.resolve |
collectOutboundHttp | Boolean | true | Collect outbound HTTP metrics via diagnostics_channel (Node 16+) |
collectCpuProfiles | Boolean | true | Enable on-demand CPU profiling when CPU exceeds threshold |
collectDeprecatedApis | Boolean | true | Track sync vs async Meteor API usage (migration readiness) |
collectPublications | Boolean | true | Detect publication over-fetching and missing projections |
collectEnvironment | Boolean | true | Capture environment metadata (packages, flags, OS info) |
collectVulnerabilities | Boolean | true | Run npm audit scans and report high/critical CVEs |
MongoDB Pool Configuration
| Option | Type | Default | Description |
|---|---|---|---|
mongoPoolFixedConnectionMemory | Number | null | Optional: fixed bytes per connection for memory estimation |
Method Tracing Configuration
| Option | Type | Default | Description |
|---|---|---|---|
traceMethodArguments | Boolean | true | Capture method arguments (sanitized) |
maxArgLength | Number | 1000 | Max string length for arguments |
traceMethodOperations | Boolean | true | Capture detailed operation timeline |
Index Usage Tracking
| Option | Type | Default | Description |
|---|---|---|---|
captureIndexUsage | Boolean | true | Capture MongoDB index usage via explain() |
indexUsageSampleRate | Number | 0.05 | Sample 5% of queries for explain() |
explainVerbosity | String | executionStats | queryPlanner | executionStats | allPlansExecution |
explainSlowQueriesOnly | Boolean | false | Only explain queries exceeding slow threshold |
Performance Safeguards
| Option | Type | Default | Description |
|---|---|---|---|
maxBatchRetries | Number | 3 | Max retries for failed batches |
requestTimeout | Number | 3000 | API request timeout (3 seconds) |
maxMemoryMB | Number | 50 | Max memory (MB) for batches |
CPU Profiling Configuration
| Option | Type | Default | Description |
|---|---|---|---|
cpuProfileThreshold | Number | 80 | CPU usage percentage to trigger an on-demand profile |
cpuProfileDuration | Number | 10000 | Duration (ms) of the CPU profile sample |
cpuProfileCooldown | Number | 300000 | Minimum time (ms) between consecutive profiles (5 minutes) |
Worker Offload (Large Pools)
| Option | Type | Default | Description |
|---|---|---|---|
useWorkerThread | Boolean | false | Enable worker thread for large pools |
workerThreshold | Number | 50 | Spawn worker if pool size exceeds this |
Background Job Monitoring
| Option | Type | Default | Description |
|---|---|---|---|
collectJobs | Boolean | true | Enable background job monitoring |
jobsInterval | Number | 30000 | Job stats collection interval (30 seconds) |
jobsPackage | String | null | Auto-detect, or specify: "msavin:sjobs" |
Log Collection
| Option | Type | Default | Description |
|---|---|---|---|
collectLogs | Boolean | true | Enable log capturing |
logLevels | Array | ["info", "warn", "error", "fatal"] | Log levels to capture (excludes debug by default) |
logSampleRate | Number | 1.0 | Sample rate (0-1). Reduce for high-volume apps |
logMaxMessageLength | Number | 10000 | Max characters per log message before truncation |
logCaptureConsole | Boolean | true | Intercept console.log, console.info, console.warn, console.error, console.debug |
logCaptureMeteorLog | Boolean | true | Intercept Meteor Log.info, Log.warn, Log.error, Log.debug |
Client-Side Error Tracking
Client-side error tracking is configured in Meteor.settings.public.skysignal.errorTracking and auto-initializes alongside RUM.
| Option | Type | Default | Description |
|---|---|---|---|
errorTracking.enabled | Boolean | true | Enable client-side error capture |
errorTracking.captureUnhandledRejections | Boolean | true | Capture unhandled Promise rejections |
errorTracking.debug | Boolean | false | Log error tracker activity to the browser console |
What Gets Monitored
System Metrics (Automatic)
The agent automatically collects:
- CPU Usage - Overall CPU utilization percentage
- CPU Cores - Number of CPU cores available
- Load Average - 1m, 5m, 15m load averages
- Memory Usage - Total, used, free, and percentage (heap, external, RSS)
- Event Loop Utilization - 0-1 ratio of how busy the event loop is (Node 14.10+)
- V8 Heap Statistics - Per-space breakdown (new_space, old_space, code_space, etc.), native context count, detached context leak detection
- Process Resource Usage - User/system CPU time, voluntary/involuntary context switches, filesystem reads/writes (via
process.resourceUsage()) - Active Resources - Handle/request counts by type (Timer, TCPWrap, FSReqCallback) for resource leak detection (Node 17+)
- Container Memory Limit - cgroup memory constraint for containerized deployments (Node 19+)
- Disk Usage - Disk space utilization (platform-dependent)
- Network Traffic - Bytes in/out (platform-dependent)
- Process Count - Number of running processes (platform-dependent)
- Agent Version - Tracks the installed agent version for compatibility checks
Collected every 60 seconds by default.
Method Traces
Automatic instrumentation of Meteor Methods:
- Method name and execution time
- Operation-level breakdown (DB queries, async operations, compute time)
- Detailed MongoDB operation tracking with explain() support
- COLLSCAN detection - Flags queries performing full collection scans (no index used)
- Slow aggregation pipeline capture - Captures sanitized pipeline stages for slow aggregations
- N+1 query detection and slow query analysis
this.unblock()analysis with optimization recommendations- Wait time tracking (DDP queue, connection pool)
- Error tracking with stack traces
- User context and session correlation
Publication Monitoring
Track publication performance:
- Publication name and execution time
- Subscription lifecycle tracking
- Document counts (added, changed, removed)
- Data transfer size estimation
- Live query efficiency (oplog vs polling)
DDP Connection Monitoring
Real-time WebSocket connection tracking:
- Active connection count and status
- Message volume (sent/received) by type
- Bandwidth usage per connection
- Latency measurements (ping/pong)
- Subscription tracking per connection
MongoDB Pool Monitoring
Connection pool health and performance:
- Pool configuration (min/max size, timeouts)
- Active vs available connections
- Checkout wait times (avg, max, P95)
- Queue length and timeout tracking
- Memory usage estimation
Live Query Monitoring
Meteor reactive query tracking with per-observer driver detection:
- Change Stream detection (Meteor 3.5+), oplog, and polling observer types
- Per-observer introspection via
handle._multiplexer._observeDriver.constructor.name - Fallback to
MONGO_OPLOG_URLheuristic for pre-3.5 Meteor apps - Reactive efficiency metric:
(changeStream + oplog) / total observers - Observer count by collection
- Document update rates
- Performance ratings (optimal/good/slow)
- Query signature deduplication
Background Job Monitoring
Track msavin:sjobs (Steve Jobs) and other job packages:
- Job execution times and status
- Queue length and worker utilization
- Failed job tracking with error details
- Job type categorization
DNS Timing
Measure DNS resolution latency to detect slow or misconfigured resolvers:
- Wraps
dns.lookup()anddns.resolve()without replacing them - Per-hostname resolution times with avg, P95, and max latency
- Failure counts and error tracking
- Ring buffer (last 500 samples) to bound memory
- Particularly useful in Docker/K8s environments where DNS is a common latency source
Reported every 60 seconds by default.
Outbound HTTP Instrumentation
Track outbound HTTP/HTTPS requests using Node.js diagnostics_channel (Node 16+):
- Zero monkey-patching — uses the same mechanism as OpenTelemetry and Undici
- Request timing breakdown: DNS, connect, TLS handshake, TTFB, total duration
- Request/response metadata: method, host, path, status code, content-length
- Error rates for external API dependencies
- Aggregated per endpoint to minimize cardinality
Reported every 60 seconds by default.
CPU Profiling (On-Demand)
Automatic CPU profiling when CPU usage spikes above a configurable threshold:
- Uses the built-in
inspectormodule (same as Chrome DevTools) — zero dependencies - Triggered automatically when CPU exceeds the threshold (default: 80%)
- Sends a summary (top functions by self-time), not raw profile data
- Configurable duration (default: 10s) and cooldown (default: 5 min between profiles)
- Minimal overhead when not actively profiling
Deprecated API Detection
Track synchronous vs asynchronous Meteor API usage to measure migration readiness:
- Wraps
Mongo.Collectionprototype methods to count sync vs async calls - Tracks
Collection.find().fetch()vsfetchAsync(),findOne()vsfindOneAsync(), etc. - Tracks
Meteor.call()vsMeteor.callAsync() - Per-collection counters with negligible overhead (just increments)
- Helps prioritize Meteor 3.x async migration efforts
Reported every 5 minutes by default.
Publication Efficiency Analysis
Detect over-fetching and unbounded publications:
- Wraps
Meteor.publishto intercept returned cursors - Checks
_cursorDescription.options.fieldsfor missing projections (over-fetching flag) - Tracks document counts per publication (average and max)
- Flags publications returning large result sets without limits
- Per-publication call counts and efficiency scores
Reported every 5 minutes by default.
Environment Snapshots
Periodic capture of application environment metadata:
- Installed package versions from
process.versionsandpackage.json - Node.js flags (
process.execArgv) - Environment variable keys (NOT values — security-conscious)
- OS platform, release, CPU count, total memory
- Collected immediately on start, then refreshed periodically
Reported every 30 minutes by default.
Vulnerability Scanning
Automated security scanning for known package vulnerabilities:
- Runs
npm audit --jsonon a configurable schedule - Supports both npm audit v6 and v7+ JSON formats
- Only reports high and critical severity vulnerabilities to reduce noise
- Tracks: package name, severity, advisory title, fix availability
- Deduplicates results (skips reporting if unchanged since last scan)
- 30-second timeout on
npm auditto prevent blocking
Reported every 1 hour by default. Initial scan delayed 60s after startup.
Error Tracking
Automatic error capture on both server and client:
- Server-side errors with stack traces
- Client-side errors via
window.onerrorandunhandledrejectionhandlers - Browser context (URL, user agent, viewport, user ID)
- Error grouping and fingerprinting
- Affected users and methods
- Build hash correlation for source maps
- Batched delivery to
/api/v1/errorswith public key authentication
Log Collection
Server-side log capture with structured metadata:
- Intercepts
console.log,console.info,console.warn,console.error,console.debug - Intercepts Meteor
Log.info,Log.warn,Log.error,Log.debug - Configurable log levels (default: info, warn, error, fatal)
- Sampling support for high-volume apps
- Message truncation to prevent oversized payloads
- Automatic host and timestamp enrichment
- Correlation with Meteor Method traces via
methodNameandtraceId - Programmatic log submission via
SkySignalAgent.addLog()
Real User Monitoring (RUM) - Client-Side
Automatic browser-side performance monitoring collecting Core Web Vitals and providing PageSpeed-style performance warnings.
What Gets Collected
Core Web Vitals:
- LCP (Largest Contentful Paint) - Measures loading performance
- Good: <2.5s | Needs Improvement: 2.5-4s | Poor: >4s
- FID (First Input Delay) - Measures interactivity
- Good: <100ms | Needs Improvement: 100-300ms | Poor: >300ms
- CLS (Cumulative Layout Shift) - Measures visual stability
- Good: <0.1 | Needs Improvement: 0.1-0.25 | Poor: >0.25
- TTFB (Time to First Byte) - Measures server response time
- Good: <800ms | Needs Improvement: 800-1800ms | Poor: >1800ms
- FCP (First Contentful Paint) - Measures perceived load speed
- Good: <1.8s | Needs Improvement: 1.8-3s | Poor: >3s
- TTI (Time to Interactive) - Measures time until page is fully interactive
- Good: <3.8s | Needs Improvement: 3.8-7.3s | Poor: >7.3s
Additional Context:
- Browser name and version
- Device type (mobile, tablet, desktop)
- Operating system
- Network connection type, downlink speed, RTT
- Viewport and screen dimensions
- User ID (via Meteor.userId() for correlation with server-side traces)
- Session ID (30-minute sessions with localStorage persistence)
- Page route and referrer
- Top 10 slowest resources
Configuration
RUM monitoring auto-initializes from your Meteor settings.
settings-development.json:
1{ 2 "skysignal": { 3 "apiKey": "sk_your_server_api_key_here", 4 "endpoint": "http://localhost:3000" 5 }, 6 "public": { 7 "skysignal": { 8 "publicKey": "pk_your_public_key_here", 9 "endpoint": "http://localhost:3000", 10 "rum": { 11 "enabled": true, 12 "sampleRate": 1.0, 13 "debug": false 14 }, 15 "errorTracking": { 16 "enabled": true, 17 "captureUnhandledRejections": true, 18 "debug": false 19 } 20 } 21 } 22}
Configuration Options:
| Option | Type | Default | Description |
|---|---|---|---|
publicKey | String | required | SkySignal Public Key (pk_ prefix) - Safe for client-side use |
endpoint | String | (same origin) | Base URL of SkySignal API (e.g., http://localhost:3000 or https://dash.skysignal.app) |
rum.enabled | Boolean | true | Enable/disable RUM collection |
rum.sampleRate | Number | Auto | Sample rate (0-1). Auto: 100% for localhost, 50% for production |
rum.debug | Boolean | false | Enable console logging for debugging |
errorTracking.enabled | Boolean | true | Enable client-side error capture via window.onerror and unhandledrejection |
errorTracking.captureUnhandledRejections | Boolean | true | Capture unhandled Promise rejections |
errorTracking.debug | Boolean | false | Log error tracker activity to the browser console |
Key Security Note:
- API Key (sk_ prefix): Server-side only, keep in private
settings.skysignal. Used for server-to-server communication. - Public Key (pk_ prefix): Client-side safe, can be in
settings.public.skysignal. Used for browser RUM collection. - This follows the Stripe pattern of separating public/private keys for security.
The agent automatically:
- Collects Core Web Vitals using Google's
web-vitalslibrary - Tracks SPA route changes and collects metrics for each route
- Batches measurements and sends via fire-and-forget HTTP with
keepalive: true - Provides PageSpeed-style console warnings for poor performance
- Correlates metrics with server-side traces via Meteor.userId()
SPA Route Change Tracking
The RUM client automatically detects route changes in single-page applications by:
- Overriding
history.pushStateandhistory.replaceState - Listening for
popstateevents (browser back/forward) - Listening for
hashchangeevents (hash-based routing)
Each route change triggers a new performance collection, allowing you to track performance across your entire application navigation flow.
Performance Warnings
When Core Web Vitals exceed recommended thresholds, the RUM collector logs PageSpeed-style warnings to the console:
[SkySignal RUM] Largest Contentful Paint (LCP) is slow: 4200ms. LCP should be under 2.5s for good user experience. Consider optimizing images, removing render-blocking resources, and improving server response times.
These warnings help developers identify performance issues during development and testing.
Manual Usage (Advanced)
While RUM auto-initializes, you can also use it manually:
1import { SkySignalRUM } from 'meteor/skysignal:agent'; 2 3// Check if initialized 4if (SkySignalRUM.isInitialized()) { 5 // Get current session ID 6 const sessionId = SkySignalRUM.getSessionId(); 7 8 // Get current metrics (for debugging) 9 const metrics = SkySignalRUM.getMetrics(); 10 11 // Get performance warnings (for debugging) 12 const warnings = SkySignalRUM.getWarnings(); 13 14 // Manually track a page view (for custom routing) 15 SkySignalRUM.trackPageView('/custom-route'); 16}
How It Works
- Session Management - Creates a 30-minute session in localStorage, renews on user activity
- Core Web Vitals Collection - Uses Google's
web-vitalslibrary for accurate measurements - Browser Context Collection - Detects browser, device, OS, network info from user agent and Navigator API
- Performance Warnings - Compares metrics against PageSpeed thresholds and logs warnings
- Batching - Batches measurements (default: 10 per batch, 5-second flush interval)
- HTTP Transmission - Sends to
/api/v1/rumendpoint withkeepalive: truefor reliability - SPA Detection - Automatically resets and re-collects metrics on route changes
Advanced Usage
Custom Metrics
Track business-specific KPIs and performance indicators with the custom metrics API:
Counter Metrics
Use counters for values that only increment (orders placed, emails sent, API calls):
1import { SkySignalAgent } from 'meteor/skysignal:agent'; 2 3// Simple counter increment 4SkySignalAgent.counter('orders.completed'); 5 6// Counter with custom value and tags 7SkySignalAgent.counter('items.sold', 5, { 8 tags: { category: 'electronics', store: 'NYC' } 9}); 10 11// Track API requests by endpoint 12SkySignalAgent.counter('api.requests', 1, { 13 tags: { endpoint: '/users', method: 'GET', status: '200' } 14});
Timer Metrics
Use timers for measuring durations (API response times, job execution, processing time):
1// Track payment processing time 2const start = Date.now(); 3await processPayment(order); 4SkySignalAgent.timer('payment.processing', Date.now() - start, { 5 tags: { provider: 'stripe', currency: 'USD' } 6}); 7 8// Track external API call duration 9const start = Date.now(); 10const result = await fetch('https://api.example.com/data'); 11SkySignalAgent.timer('external.api.call', Date.now() - start, { 12 tags: { service: 'example', endpoint: '/data', status: result.status } 13});
Gauge Metrics
Use gauges for point-in-time values that go up or down (queue size, active users, inventory):
1// Track queue depth 2const queueSize = await getQueueSize('email-queue'); 3SkySignalAgent.gauge('queue.size', queueSize, { 4 unit: 'items', 5 tags: { queue: 'email' } 6}); 7 8// Track active users 9const activeUsers = Meteor.server.sessions.size; 10SkySignalAgent.gauge('users.active', activeUsers, { 11 unit: 'users' 12}); 13 14// Track inventory levels 15SkySignalAgent.gauge('inventory.stock', 150, { 16 unit: 'items', 17 tags: { product: 'widget-123', warehouse: 'NYC' } 18});
Generic trackMetric Method
For full control, use the generic trackMetric() method:
1SkySignalAgent.trackMetric({ 2 name: 'checkout.flow', 3 type: 'counter', // 'counter' | 'timer' | 'gauge' 4 value: 1, 5 unit: 'conversions', // optional 6 tags: { // optional - for filtering in dashboard 7 product: 'premium', 8 region: 'us-east-1' 9 } 10});
Manual Trace Submission
Track custom operations:
1const startTime = Date.now(); 2 3// Your code here... 4 5SkySignalAgent.client.addTrace({ 6 traceType: 'method', 7 methodName: 'myCustomOperation', 8 timestamp: new Date(startTime), 9 duration: Date.now() - startTime, 10 userId: this.userId, 11 operations: [ 12 { type: 'start', time: 0, details: {} }, 13 { type: 'db', time: 50, details: { collection: 'users', func: 'findOne' } }, 14 { type: 'complete', time: 150, details: {} } 15 ] 16});
Manual Log Submission
Send structured logs programmatically, bypassing console.* / Meteor Log.* interception:
1import { SkySignalAgent } from 'meteor/skysignal:agent'; 2 3// Simple log 4SkySignalAgent.addLog('info', 'User signed up', { userId: 'abc123' }); 5 6// Error log with context 7SkySignalAgent.addLog('error', 'Payment failed', { 8 orderId: 'xyz-789', 9 provider: 'stripe', 10 errorCode: 'card_declined' 11}); 12 13// Warning with structured metadata 14SkySignalAgent.addLog('warn', 'Rate limit approaching', { 15 endpoint: '/api/search', 16 currentRate: 450, 17 limit: 500 18});
Log levels: debug, info, warn, error, fatal
Logs submitted via addLog() are tagged with source: "api" to distinguish them from auto-captured console/Meteor logs.
Stopping the Agent
To gracefully stop the agent (e.g., during shutdown):
1SkySignalAgent.stop();
This will:
- Stop all collectors
- Flush any remaining batched data
- Clear all intervals
Performance Impact
The agent is designed to have minimal performance impact on your application:
Built-in Optimizations
- Fire-and-forget batching - Data is batched and sent asynchronously using
setImmediate()for lowest latency - HTTP connection pooling - Reuses TCP connections with
keepAliveto reduce handshake overhead - Gzip compression - Large payloads (>1KB) are compressed before sending to reduce bandwidth
- Non-blocking collection - System metrics use async commands to avoid blocking the event loop
- Object pooling - HTTP request tracking reuses pre-allocated objects to reduce GC pressure
- Optimized URL matching - Combined regex patterns for O(1) exclude pattern matching
- Staggered startup - Collectors start with 500ms intervals to avoid CPU spikes at boot
- Configurable intervals - Adjust collection frequency based on your needs
- Automatic retries - Failed requests are re-queued with exponential backoff and jitter
Typical Overhead
- CPU: < 1% additional usage
- Memory: ~10-20MB for batching queues
- Network: ~1KB per metric (less with compression), sent in batches
- Event loop: < 1ms impact per collection cycle
Troubleshooting
Agent Not Sending Data
- Check that your API key is correct
- Verify
enabled: truein configuration - Check server logs for error messages
- Verify network connectivity to SkySignal API
High Memory Usage
If you notice high memory usage:
- Reduce
batchSizeto flush data more frequently - Reduce collection intervals
- Disable collectors you don't need
Missing System Metrics
Some system metrics (disk, network, process count) require platform-specific APIs:
- Use the
systeminformationnpm package for comprehensive cross-platform metrics - These metrics may return
nullon certain platforms
API Reference
SkySignalAgent
Main agent singleton instance.
Configuration Methods
configure(options)- Configure the agent with optionsstart()- Start all collectors and monitoringstop()- Stop all collectors and flush data
Custom Metrics Methods
| Method | Description |
|---|---|
counter(name, value?, options?) | Track incremental values (default value: 1) |
timer(name, duration, options?) | Track durations in milliseconds |
gauge(name, value, options?) | Track point-in-time values |
trackMetric(options) | Generic method with full control |
Log Methods
| Method | Description |
|---|---|
addLog(level, message, metadata?) | Submit a structured log entry. Level: debug, info, warn, error, fatal |
Options object:
tags- Object with key-value pairs for filteringunit- Unit of measurement (e.g., 'ms', 'items', 'percent')timestamp- Optional Date (defaults to now)
Properties
client- HTTP client instance for manual data submissionconfig- Current configuration objectcollectors- Active collector instancesstarted- Boolean indicating if agent is running
Support
Changelog
v1.0.24 (DDP Queue Stack Overflow Fix)
- Fix secondary stack overflow in
wrapUnblock- When_recordBlockingTime()threw an error with a very deep stack trace (e.g. from mutual-recursion across wrapper layers),console.error(error)triggeredsource-map-support'sprepareStackTracewhich re-overflowed, causing a secondaryRangeError. Allconsole.errorcalls inDDPQueueCollectornow useString(error)to serialize the error message without triggering stack trace reprocessing. See #12. - Move
originalUnblock()call intofinallyblock - The cleanup logic (delete self.currentProcessing[session.id]andoriginalUnblock()) was previously in thetryblock after the metrics recording. Ifconsole.erroritself failed in thecatchblock, these cleanup steps were skipped, permanently stalling the DDP queue for that session. Both are now in afinallyblock to guarantee execution regardless of error handling failures. - New test: deep-stack error handling - Added unit test verifying that
wrapUnblockcorrectly invokes the original unblock function even when the metrics callback throws an error with a deeply nested stack trace. - Fix root cause: async
originalUnblock()viaqueueMicrotask- The primary stack overflow (#7) was caused byoriginalUnblock()being called synchronously in thefinallyblock. When another APM agent (e.g.montiapm:agent) also wraps the DDP session's unblock function, the synchronous call chain creates infinite recursion.originalUnblock()is now called viaqueueMicrotask()to break the synchronous chain. This is safe because Meteor's ownrunHandlers()independently calls its native unblock after the protocol handler returns — our call is purely for instrumentation. - Conflicting APM agent detection - On startup,
DDPQueueCollectornow checks formontiapm:agentandmdg:meteor-apm-agentand logs a warning if detected. Running multiple APM agents that wrap DDP internals simultaneously is not recommended. - New test: synchronous recursion prevention - Added cross-collector test simulating 200 chained unblock calls from another APM agent, verifying that
queueMicrotaskbreaks the recursion without stack overflow.
v1.0.23 (BullMQ Support & Job Package Tracking)
- BullMQ queue monitoring - The agent now supports BullMQ as a second job queue backend alongside
msavin:sjobs.BullMQMonitordiscovers queues automatically by scanning Redis forbull:*:metakey patterns and attachesQueueEventslisteners for real-time job lifecycle tracking (active, completed, failed, stalled, progress). Supports manual queue configuration viabullmqQueuesfor non-standard Redis key prefixes. Includes an LRU job detail cache (jobCacheMaxSize: 2000,jobCacheTTL: 120000) to fetch full job data on failure without hitting Redis on every event. - Trace correlation for BullMQ - Wraps
Queue.add()andQueue.addBulk()to inject__skysignal_traceIdinto job data, linking each BullMQ job back to the originating Meteor Method trace. On job completion/failure, the trace ID is extracted from the job payload and attached to the job record, enabling end-to-end visibility from Method call through queue execution. - Multi-package auto-detection - The
JobCollectorfactory now checks for bothmsavin:sjobsandbullmqat startup. If both packages are installed, the agent monitors both simultaneously and tags each job with its originating package. UsejobsPackagein settings to force a specific package if needed. jobsPackagefield in job event payloads -BaseJobMonitor._sendJobEvent()now injectsjobsPackage: this.getPackageName()into every outbound job event. This is the single choke point for all job data, so bothSteveJobsMonitorandBullMQMonitorevents are tagged automatically without subclass changes.- Platform:
jobsPackageschema field and index - AddedjobsPackage(optional String) to theBackgroundJobscollection schema and a new compound index{ customerId, siteId, jobsPackage, queuedAt }for efficient package-filtered queries. - Platform: Package-aware query methods - All job query service methods (
getMetrics,getQueueStats,queryJobs,getJobTypePerformance,getLatencyDistribution,getFailureRateTrend) now accept an optionaljobsPackagefilter parameter. AddedgetJobsPackages()to return distinct packages for a site. - Platform: Jobs tab package filter - When a site has jobs from multiple packages, the Jobs tab shows a package filter dropdown (Autocomplete) alongside the existing queue filter. All sub-tabs (Running, Failed, Scheduled, Recent Jobs, Performance, Analytics) respect the selected package filter. Job rows display a small package Chip next to the job type when multiple packages are present.
- BullMQ configuration options -
bullmqRedis(connection object),bullmqQueues(manual queue list),detailedTracking(fetch full job details on failure),jobCacheMaxSize, andjobCacheTTL. See documentation for full reference. - Backward compatible -
jobsPackageis optional in the schema and defaults tonull. Pre-1.0.23 agents continue to work; the UI hides the package filter when only one (or zero) packages are present.
v1.0.22 (Graceful Shutdown & Stale Job Fixes)
- Graceful shutdown on SIGTERM/SIGINT - The agent now registers
process.once("SIGTERM")andprocess.once("SIGINT")handlers during auto-start. When the host app shuts down (e.g., new deployment on Galaxy), the agent stops all collectors, flushes all pending telemetry batches, and logs a shutdown message. Previously, deploys killed the agent mid-flight and all buffered data was silently dropped. - Fix
client.stop()dropping final flush -SkySignalClient.stop()previously setthis.stopped = truebefore callingthis.flush(), causing_sendBatch()to check the flag and silently discard every pending batch. The flag is now set after the final flush so all buffered data is actually sent. - Fix
agent.stop()not fully cleaning up -agent.stop()previously calledclient.flush()(fire-and-forget, no timer cleanup) instead ofclient.stop()(clears auto-flush timer, clears retry timers, performs final flush, then sets stopped flag). Timers and retries now properly stop on shutdown. - Fix
_sendBatchinner stopped check blocking final flush - Removed redundantthis.stoppedguard inside thesetImmediatecallback in_sendBatch(). The outer guard beforesetImmediateis sufficient, and the inner check was racing withstop()to block final-flush HTTP requests that had already been dispatched. - Fix Steve Jobs observer race condition (Meteor 3.x) -
SteveJobsMonitor.setupHooks()nowawaitscursor.observe(), which returns a Promise in Meteor 3.x. Previously the observer was set up asynchronously while_scanExistingJobs()ran synchronously, creating a window where jobs could complete before the observer was ready. Completion events in that window were missed, leaving jobs permanently stuck as "running" on the server. - Fix orphaned scheduled/replicated jobs -
_handleJobRemoved()now handles docs removed withstate: "pending"by emitting start + complete events. Steve Jobs callsinstance.remove()without setting state to "success", so replicated jobs that were only tracked as "pending" previously had no completion event and stayed orphaned on the server forever. - Remove redundant
_scanExistingJobs()- The awaited observer's initialaddedcallbacks already cover all existing documents, making the separate synchronous scan both redundant and racy. - Fix observer cleanup -
cleanupHooks()is now async and checks for.stop()on the resolved observer handle, instead of calling.stop()on the unresolved Promise that was stored before this fix.
v1.0.21 (Nested Cgroup Fix & Uptime Metric)
- Fix container metrics on Galaxy and nested cgroup hierarchies - The cgroup detection in v1.0.18/v1.0.20 hardcoded root paths (
/sys/fs/cgroup/memory.max,/sys/fs/cgroup/cpu.max). On Galaxy and other platforms that use nested cgroup hierarchies (e.g.,/sys/fs/cgroup/kubepods.slice/kubepods-pod123.slice/...), the root files return the parent slice limit (often 512 MB or unlimited) instead of the per-container limit (e.g., 2 GB on Galaxy "Double" plan). This caused SkySignal to report 512 MB / 92% (Critical) when the real container had 2 GB at ~23% usage. Added_getCgroupBase()which parses/proc/self/cgroupto resolve the actual cgroup path for the current process, handling both cgroup v2 (0::/lines) and cgroup v1 (:memory:controller lines). All four cgroup detection methods (_detectMemoryLimit,_detectCpuQuota,_getContainerMemoryUsage,_detectCgroupMemUsagePath) now try the resolved nested path first, then fall back to root paths for simple container setups. This is the same technique used by cAdvisor, Kubernetes metrics-server, and Galaxy's own dashboard. - New
uptimemetric field - Now collectsprocess.uptime()(seconds since the Node.js process started) each collection cycle. Previously the System tab showed "Uptime: 0m" because this field was never sent by the agent. process.constrainedMemory()safety check - Addedlimit < Number.MAX_SAFE_INTEGERguard to the Node 19+constrainedMemory()strategy, preventing false positives when the function returns a sentinel value indicating no cgroup limit.
v1.0.20 (Publication Context & Observer Leak Detection & Container-Aware Metrics)
- Container memory detection - When the agent runs inside a Docker container (e.g., Meteor Galaxy),
os.totalmem()/os.freemem()report host machine values, not container limits. The agent now detects cgroup memory limits via a 3-strategy fallback:process.constrainedMemory()(Node 19+), cgroup v2 (/sys/fs/cgroup/memory.max), cgroup v1 (/sys/fs/cgroup/memory/memory.limit_in_bytes). When a limit is found,memoryTotal,memoryUsed,memoryFree, andmemoryUsagereport container-level values instead of host-level values. - Container CPU quota detection - Reads CPU quota from cgroup v2 (
/sys/fs/cgroup/cpu.max) or cgroup v1 (cpu.cfs_quota_us / cpu.cfs_period_us). When a quota is set,cpuCoresreports the effective container CPU count (e.g., 2.0 for a 200% quota) and process-level CPU % normalizes against the container quota, not host cores. - Container memory usage per-cycle - Reads current memory usage each collection cycle via
process.availableMemory()(Node 19+), cgroup v2 (memory.current), or cgroup v1 (memory.usage_in_bytes), with aheapUsedfallback. - New metric fields -
isContainerized(Boolean) indicates container detection;hostMemoryTotal(Number) preserves the originalos.totalmem()value for diagnostics when containerized. - Non-containerized environments unchanged - All metrics remain identical when no cgroup limits are detected (local dev, bare-metal servers).
- Publication context propagation via AsyncLocalStorage -
PublicationTracernow wrapsMeteor.publishhandlers inAsyncLocalStorage.run(), settingpublicationName,connectionId, andisAutoPublishin the async context.LiveQueriesCollectorreads this context in its_observeChangeswrapper viapublicationContextStore.getStore()(O(1)), so every observer created inside a publication handler now carries the publication name and DDP connection ID that owns it. This enables the platform's enhanced leak detection to distinguish auto-publish observers from real subscription leaks. - Auto-publish detection - Unnamed/null publications (auto-publish patterns) are wrapped with
isAutoPublish: truecontext. Observers created by these publications are tagged accordingly, allowing the platform to apply 3x longer thresholds (72h vs 24h) before flagging them as leaked. - New observer payload fields -
isAutoPublish(Boolean) andconnectionId(String) are now included in every observer record sent to the platform. Both fields are backward-compatible (defaultfalse/nullfor pre-1.0.20 agents).
v1.0.19 (Bug Fixes & Code Quality)
- Fix LiveQueriesCollector showing 0 observers and 0 metrics - Completely rewrote observer interception. The previous approach wrapped
Mongo.Collection.prototype.findto patchcursor.observe/cursor.observeChangeson each returned cursor instance, but this failed because: (1) in Meteor 3.x,observeChangesis async (returns a Promise viaMongoConnection._observeChanges) but the wrapper treated it synchronously; (2) the per-cursor instance patching was fragile. Now hooks directly intoMongoInternals.Connection.prototype._observeChanges— the single async bottleneck ALL server-side observers funnel through. Uses a two-phase tracking approach:_createObserverData()creates a provisional observer record BEFORE calling the original_observeChanges, so that wrappedadded/changed/removedcallbacks can count initial documents arriving during the await._finalizeObserver()then links the handle's multiplexer and driver type after the await. Deduplicates by multiplexer identity (not query hash) so observers sharing a Meteor ObserveMultiplexer are counted as one server-side resource with multiple handlers. Falls back toCollection.prototype.findwrapping whenMongoInternalsis unavailable. - Fix LiveQueriesCollector config missing from DEFAULT_CONFIG -
collectLiveQueries,liveQueriesInterval, andliveQueriesPerformanceThresholdswere defined in theSkySignalAgentconstructor but missing fromconfig.jsDEFAULT_CONFIG. SincemergeConfig()spreadsDEFAULT_CONFIGfirst, these values were always overwritten toundefined, silently disabling live query collection. - Fix container memory usage reporting >100% -
SystemMetricsCollectorpreviously calculated container memory asprocessMemory.rss / constrainedMemory * 100, which could exceed 100% because RSS (Resident Set Size) includes shared library pages, memory-mapped files, and kernel page cache that don't count against the container's cgroup memory limit. Now usesprocess.availableMemory()(Node 19+), which reads directly from the cgroup memory controller and accounts for reclaimable buffers, to compute usage as(constrainedMemory - availableMemory) / constrainedMemory * 100. Falls back toheapUsed / constrainedMemory * 100on older Node versions. This aligns reported memory with what container orchestrators (e.g., Meteor Galaxy) actually report. - Fix observer stop logging crash -
LiveQueriesCollector._wrapHandle()usedthis._log()inside a regularfunction()callback wherethisreferred to the handle object, not the collector instance. Changed toself._log()to use the captured closure variable. Previously, callinghandle.stop()would throwTypeError: this._log is not a function, silently preventing observer lifecycle metrics from being recorded. - Fix P95 percentile off-by-one -
MongoPoolCollector,DnsTimingCollector, andDiagnosticsChannelCollectorall usedMath.floor(count * 0.95)to index into a sorted array, which overshoots the true 95th percentile by one position (e.g., for 100 items, returns the 96th element instead of the 95th). Changed toMath.ceil(count * p) - 1across all three collectors. Extracted to sharedpercentile()utility inlib/utils/percentile.js. - Fix MongoPoolCollector.stop() killing other event listeners -
stop()calledclient.removeAllListeners(eventName)for each pool event, which removed ALL listeners for that event — including those registered by the application or other collectors. Now stores individual handler references instart()and callsclient.removeListener(eventName, handler)instop()to remove only the collector's own handlers. - Fix circular buffer read after wrap-around -
MongoPoolCollector._calculateCheckoutMetrics()usedcheckoutSamples.slice(0, count)to extract samples, which returns incorrect data after the circular buffer wraps (old data mixed with new). Now correctly reads from the current write index forward using modular arithmetic to reconstruct the proper time-ordered sequence. - Shared percentile utility - Extracted percentile calculation to
lib/utils/percentile.jswithpercentile(sorted, p)andpercentiles(values)functions, replacing duplicated math inMongoPoolCollector,DnsTimingCollector, andDiagnosticsChannelCollector. - Shared buffer eviction utility - Extracted array trimming to
lib/utils/buffer.jswithtrimToMaxSize(array, maxSize), replacing duplicatedsplice(0, length - max)patterns inDnsTimingCollector,DiagnosticsChannelCollector, andMongoPoolCollector._recordPoolWaitTime. - Leak-detection field tests - Added 19 unit tests (
LiveQueriesCollector.leakFields.test.js) verifying the collector produces correct values for fields used by the server-sideObserverLeakDetectionService:_wrapCallbackscorrectly incrementsliveUpdateCountandlastActivityAtonly after initial load completes (not during the initial document fetch),_wrapHandlecalculatesobserverLifespanin seconds on stop, and_createObserverDatainitializes all leak-relevant fields to safe defaults. These tests ensure the agent emits the data contract that leak detection heuristics (inactive observers, long-lived stale observers, orphaned observers) depend on. - Remove stale
_generateQuerySignaturetests - Deleted 5 tests for_generateQuerySignatureinLiveQueriesCollector.test.jsthat were left behind when the method was removed during the v1.0.19 observer interception rewrite. These tests were failing withTypeError: collector._generateQuerySignature is not a function.
v1.0.18 (Container-Aware Metrics)
- Container-aware memory usage -
SystemMetricsCollectornow usesprocess.constrainedMemory()(Node 19+) to detect cgroup memory limits in containerized deployments. When a cgroup limit is present, memory usage is calculated asprocessMemory.rss / constrainedMemory * 100instead of(os.totalmem() - os.freemem()) / os.totalmem() * 100. The OS-level calculation counts kernel buffer/cache as "used", which dramatically overstates actual memory pressure in containers (e.g., reporting 89% when real RSS usage is 27%). - Process-level CPU measurement - Replaced OS-level idle-time CPU calculation with
process.cpuUsage()delta tracking. The previous approach measured total system CPU across all processes, which is misleading for a single Node.js application in a shared or containerized environment. Now tracks user + system CPU microseconds between collection intervals, divided by available parallelism, to report the actual CPU consumed by the monitored Meteor process.
v1.0.17 (Bug Fixes, Performance & Testing)
- DDP queue unblock recursion fix - Restructured
DDPQueueCollector.wrapUnblock()to eliminate a remaining infinite recursion path. The catch block previously retried callingoriginalUnblock()after a failure, butoriginalUnblockcan itself be a wrapper from another layer (e.g.,MethodTracer). If that wrapper threw, the retry would re-enter it, creating unbounded mutual recursion andRangeError: Maximum call stack size exceeded. The fix sets theunblockedguard immediately on entry, isolates metrics collection in its own try/catch so failures are non-fatal, and callsoriginalUnblock()exactly once with no retry. (fixes #7) - Console error object serialization -
ErrorTrackernow properly serializes object arguments passed toconsole.error()usingJSON.stringifyinstead ofString(). Previously,console.error('test', {a:1, b:2})would be captured as"test [object Object]"— it now correctly captures"test {"a":1,"b":2}". The same fix applies toUnhandledRejectionevents where the rejection reason is a plain object rather than an Error instance. Serialization is depth-limited (5 levels) and size-capped (5KB) to prevent oversized payloads from deeply nested objects. Circular references are detected and replaced with"[Circular]". (fixes #10) - Use
os.availableParallelism()for CPU count - Replacedos.cpus().lengthwithos.availableParallelism()inSystemMetricsCollectorandEnvironmentCollector. The Node.js docs advise against usingos.cpus().lengthto determine available parallelism, as it can return an empty array on some systems.os.availableParallelism()(Node 18.14+) is the recommended API for this purpose. - Replace sync FS calls with async - Converted all
readFileSync,readdirSync, andexistsSynccalls to their async equivalents (fs/promises) inSystemMetricsCollector,EnvironmentCollector, andVulnerabilityCollector. These ran in background collectors on periodic intervals but still blocked the event loop unnecessarily. Now usesfs.readFile,fs.readdir,fs.accessto avoid blocking the host application's event loop. - Guard SkySignalClient console output behind
debugflag - Allconsole.errorandconsole.warncalls inSkySignalClient(serialization failures, network errors, timeouts, retry queue overflow, dropped batches) are now gated behind adebugoption. Previously these logged unconditionally, which could be noisy in production. The abort-error log (fix #4) was already guarded; this change applies the same pattern to all 11 remaining log sites. - Fix MethodTracer result truncation -
MethodTracerattempted toJSON.parse()a truncated JSON string when serializing large method results (>500 chars). Slicing a JSON string mid-token always produces invalid JSON, so the parse always threw and the result was silently replaced with'<unable to serialize>'. Now stores the truncated string directly instead of attempting to round-trip it throughJSON.parse. - Fix VulnerabilityCollector timer leak on early stop -
VulnerabilityCollector.start()used a 60-secondsetTimeoutfor the initial collect delay but did not store the timer ID. Ifstop()was called within the first 60 seconds, the delayed collect would still fire. Now stores the timer ID in_delayTimerIdand clears it instop(). - Eliminate JSON.parse/stringify from DDP hot path -
DDPCollectorpreviously calledJSON.parse()on every outgoing DDP message (just to read themsgfield) andJSON.stringify()on every incoming and outgoing message (just for byte-size estimation). On a busy app with 100+ subscriptions pushing frequent updates, this added thousands of serialize/deserialize cycles per second. Replaced withextractMsgType()(substring extraction) for message type detection andestimateMsgSize()(shallow key walk) for size estimation. JSON.parse is now only used for the small subset of messages that require structured data (subscription lifecycle events). - Replace
new Date()withDate.now()in all hot paths -DDPCollector,LiveQueriesCollector, andMethodTracercreatednew Date()objects in per-message and per-observer-callback paths. Each allocation adds GC pressure. Replaced withDate.now()(returns a number with zero heap allocation) across all subscription tracking, observer callback wrappers, and method context creation. Timestamps are converted to Date objects only at serialization time. - Reduce MethodTracer per-method allocation overhead - Every method invocation allocated a
new Map()for query fingerprints, empty arrays for operations and slow queries, and generated a trace ID viaMath.random().toString(36). Changed to: counter-based trace IDs (no string conversion), lazy-initialized Map and arrays (only allocated when the method actually performs database operations). Methods that don't touch the database (pure computation, DDP calls) now allocate significantly less. - Optimize SkySignalClient flush path -
_safeStringifynow tries fastJSON.stringify()first (no WeakSet, no replacer function overhead) and only falls back to circular-reference-safe serialization if the fast path throws. Hoisted_getEndpointForBatchTypeand_getPayloadKeylookup tables to module-level constants (avoids creating new object literals on every batch send). Replaced[...batch]spread copy with reference swap in_sendBatch. - Fix buffer eviction patterns -
DnsTimingCollectorandDiagnosticsChannelCollectorusedArray.slice(-max)to evict old samples, which allocated a new array on every eviction. Replaced with in-placesplice().MongoPoolCollector._recordPoolWaitTimeusedArray.shift()(O(n) at 1000 elements) on every new sample; replaced with batch eviction viasplice()that triggers less frequently.DDPQueueCollectorusedObject.keys().lengthto check cache size on every insert; replaced with an O(1) counter. - 963 unit tests with GitHub Actions CI - Added a comprehensive standalone test suite (Mocha + Chai + Sinon) covering all collectors, client modules, and library utilities. Includes regression tests for bugs #7 and #10. Tests run via
npm testwithout requiring a Meteor environment. Added GitHub Actions workflow (.github/workflows/test.yml) to run tests on push/PR against Node.js 20 and 22.
v1.0.16 (Bug Fixes)
- DDP queue infinite recursion fix - Removed
finallyblock inDDPQueueCollector._hijackMethodHandlerthat unconditionally calledunblock()after every method invocation. When sessions were wrapped more than once (e.g., agent stop/restart during hot reload), the stackedfinallyblocks triggered cross-layer recursion through the originalunblockreference, causingRangeError: Maximum call stack size exceeded. Added a_skySignalDDPQueueWrappedsentinel to prevent double-wrapping sessions entirely. (fixes #5) - Stale keepAlive socket fix - Added
freeSocketTimeout: 15000to both HTTP and HTTPS agents used bySkySignalClient. Previously, idle keepAlive sockets could sit in the pool indefinitely; when the server closed its end, the next request reusing the stale socket would get anAbortError. Thesubscriptionsbatch type was disproportionately affected due to its longer flush cadence. Abort errors are now downgraded to debug-onlyconsole.warnsince the retry logic already handles them transparently. (fixes #4) - Screenshot capture import fix -
ScreenshotCapturenow importshtml2canvasas an ES module instead of checking for a global variable. Sincehtml2canvasis already declared inNpm.depends()inpackage.js, Meteor bundles it automatically — host applications no longer need to install it as a separate dependency. (fixes #3)
v1.0.15 (New Features)
7 new collectors, enhanced system metrics, COLLSCAN detection, sendBeacon transport, and worker thread offloading.
New Collectors
- DNS Timing (
DnsTimingCollector) - Wrapsdns.lookupanddns.resolveto measure DNS resolution latency. Tracks per-hostname timing, P95/max latency, and failure counts. Identifies slow resolvers in Docker/K8s environments. - Outbound HTTP (
DiagnosticsChannelCollector) - Uses Node.jsdiagnostics_channelAPI (Node 16+) to instrument outbound HTTP/HTTPS requests without monkey-patching. Captures timing breakdown (DNS, connect, TLS, TTFB), status codes, and error rates for external dependencies. - CPU Profiling (
CpuProfiler) - On-demand CPU profiling via the built-ininspectormodule. Automatically triggers when CPU exceeds a configurable threshold (default: 80%), captures a 10-second profile, and sends a summary of top functions by self-time. Configurable cooldown prevents over-profiling. - Deprecated API Detection (
DeprecatedApiCollector) - WrapsMongo.Collectionprototype methods andMeteor.callto count sync vs async invocations. Tracksfind().fetch()vsfetchAsync(),findOne()vsfindOneAsync(),insert/update/removevs async variants. Helps measure Meteor 3.x migration readiness. - Publication Efficiency (
PublicationTracer) - WrapsMeteor.publishto intercept returned cursors. Detects publications missing field projections (over-fetching) and those returning large document sets without limits. Reports per-publication call counts, document averages, and efficiency scores. - Environment Snapshots (
EnvironmentCollector) - Captures installed package versions (process.versions+package.json), Node.js flags, environment variable keys (not values), and OS metadata. Collected immediately on start, then refreshed every 30 minutes. - Vulnerability Scanning (
VulnerabilityCollector) - Runsnpm audit --jsonhourly (with 30s timeout). Parses both v6 and v7+ audit formats. Reports high/critical vulnerabilities with package name, severity, advisory title, and fix availability. Deduplicates unchanged results.
Enhanced System Metrics
- Event Loop Utilization (ELU) - 0-1 ratio of event loop busyness via
performance.eventLoopUtilization()(Node 14.10+) - V8 Heap Statistics - Per-heap-space breakdown (new_space, old_space, code_space, etc.) via
v8.getHeapStatistics()andv8.getHeapSpaceStatistics(). Includes native context count and detached context leak detection. - Process Resource Usage - User/system CPU time, voluntary/involuntary context switches, filesystem reads/writes via
process.resourceUsage() - Active Resources - Handle/request counts by type (Timer, TCPWrap, FSReqCallback, etc.) via
process.getActiveResourcesInfo()(Node 17+) for resource leak detection - Container Memory Limit - cgroup memory constraint via
process.constrainedMemory()(Node 19+) for containerized deployments - Agent Version -
agentVersionfield added to every system metrics payload for compatibility tracking
Method Tracer Enhancements
- COLLSCAN flagging - Slow queries are now flagged with
collscan: truewhenexplain()data indicates a full collection scan (no index used, ortotalDocsExamined > 0withtotalKeysExamined === 0). Applied both at initial detection time and retroactively after async explain completes. - Slow aggregation pipeline capture - Slow aggregation operations now include the sanitized pipeline stages in the slow query entry for debugging.
Client-Side Transport Improvements
sendBeaconprimary transport -ErrorTrackerandRUMClientnow usenavigator.sendBeacon()as the primary transport for small payloads (<60KB for errors, all RUM batches). This is truly fire-and-forget with zero async overhead — no promises, no callbacks, no event loop work. Falls back tofetchwithkeepalivefor large payloads or when sendBeacon returns false.- Public key via query param -
sendBeaconcannot set custom headers, so the public key is passed as?pk=query parameter (lazily cached URL). TheX-SkySignal-Public-Keyheader is still sent on fetch fallback for backward compatibility.
Batching & Infrastructure
- 7 new batch types in
SkySignalClient:dnsMetrics,outboundHttp,cpuProfiles,deprecatedApis,publications,environment,vulnerabilities— each with dedicated REST endpoints and payload keys - Worker thread pool (
WorkerPool+compressionWorker) - Optionalworker_threads-based compression offloading to prevent gzip work from blocking the host application's event loop. Lazy initialization, auto-restart on crash, and graceful main-thread fallback.
Configuration
- 18 new config fields added to
DEFAULT_CONFIGandvalidateConfig()for all new collectors:collectDnsTimings,dnsTimingsInterval,collectOutboundHttp,outboundHttpInterval,collectCpuProfiles,cpuProfileThreshold,cpuProfileDuration,cpuProfileCooldown,cpuProfileCheckInterval,collectDeprecatedApis,deprecatedApisInterval,collectPublications,publicationsInterval,collectEnvironment,environmentInterval,collectVulnerabilities,vulnerabilitiesInterval - All new collectors are enabled by default and use staggered startup to avoid CPU spikes at boot
v1.0.14 (Bug Fix)
- Silent production logging - Replaced bare
console.log()calls with debug-guarded_log()helpers across all collectors (HTTPCollector,DDPCollector,DDPQueueCollector,LiveQueriesCollector,MongoCollectionStatsCollector,BaseJobMonitor,SteveJobsMonitor,JobCollector). Previously, operational messages like "Batched 1 HTTP requests", "Sent 18 subscription records", and job lifecycle events were unconditionally printed to stdout regardless of thedebugsetting. All informational logs are now silent by default and only appear whendebug: trueis set in the agent configuration.
v1.0.13 (Bug Fix)
- Trace context isolation - Replaced shared
_currentMethodContextvariable with Node.jsAsyncLocalStorageto properly isolate method trace contexts across concurrent async operations. Fixes a bug where background job database queries (e.g.,jobs_data.findOneAsync()) would leak into unrelated Meteor method traces when both executed concurrently on the same event loop.
v1.0.12 (New Features & Bug Fixes)
- Change Streams support - Live query observer detection now identifies Change Stream drivers (Meteor 3.5+) alongside oplog and polling, with per-observer introspection instead of global heuristic
- Log collection - New
LogsCollectorcapturesconsole.*and MeteorLog.*output with structured metadata, configurable levels, and sampling support. Includes publicSkySignalAgent.addLog()API for programmatic log submission - Silent failure for optional packages - HTTP and Email package instrumentation no longer logs warnings when packages aren't installed; errors are suppressed to debug-only output (fixes #1)
- Client-side error tracking fix - Fixed 400 "Invalid JSON" response when the agent sends batched client errors to
/api/v1/errors. The server endpoint now correctly reads the pre-parsed request body and supports both batched{ errors: [...] }and single error formats (fixes #2)
v1.0.11 (New Feature)
- Added client IP address collection for enhanced user context in error tracking and performance correlation
v1.0.7 (Bug Fixes)
- Increased default timeout from 3000ms to 15000ms for API requests to handle slow networks
v1.0.4 (Rollback)
- Reverted to Meteor 2.16+ compatibility due to Node.js version issues with older Meteor versions (Only Meteor 3.x supports Node 20+)
v1.0.3 (Bug Fixes)
- Polyfill for
AbortSignal.timeout()to support older Node.js versions
v1.0.2 (Bug Fixes)
- Updated Meteor version compatibility to 2.16
v1.0.1 (Bug Fixes)
- Fixed incorrect default endpoint URL
v1.0.0 (Initial Release)
- Complete Method Tracing - Automatic instrumentation with operation-level profiling
- MongoDB Query Analysis - explain() support, N+1 detection, slow query analysis
this.unblock()Analysis - Optimization recommendations for blocking methods- DDP Connection Monitoring - Real-time WebSocket tracking with latency metrics
- MongoDB Pool Monitoring - Connection pool health, checkout times, queue tracking
- Live Query Monitoring - Oplog vs polling efficiency tracking
- Background Job Monitoring - Support for msavin:sjobs with extensible adapter system
- HTTP Request Monitoring - Automatic tracking of server HTTP requests
- Collection Stats - MongoDB collection size and index statistics
- App Version Tracking - Auto-detection from package.json with manual override
- Build Hash Tracking - Source map correlation via BUILD_HASH/GIT_SHA env vars
- Performance Safeguards - Memory limits, request timeouts, batch retries
- Real User Monitoring (RUM) - Client-side Core Web Vitals collection (LCP, FID, CLS, TTFB, FCP, TTI)
- PageSpeed-Style Warnings - Automatic performance threshold warnings in console
- SPA Route Tracking - Automatic performance collection on every route change
- Session Management - 30-minute sessions with localStorage persistence
- Browser Context Collection - Automatic device, browser, OS, network information
- User Correlation - Uses Meteor.userId() to correlate with server-side traces
- Fire-and-Forget HTTP - Reliable transmission with keepalive during page unload
- Configurable Sampling - Auto-detects environment (100% dev, 50% prod) or manual configuration
- web-vitals Integration - Uses Google's official Core Web Vitals library
- System metrics monitoring (CPU, memory, load average)
- HTTP client with batching and auto-flush
- Configurable collection intervals
- Basic error handling and retry logic
- Multi-tenant ready architecture