Competitive Intelligence Tracker

1. The Business Problem (Why build this?)

In today's hyper-competitive digital landscape, understanding the intricate moves of competitors is no longer a luxury but an existential necessity. Businesses across all sectors grapple with a persistent, multifaceted challenge: extracting timely, accurate, and actionable intelligence from the vast, unstructured ocean of the public web. The typical competitive intelligence process is fraught with inefficiencies, leading to delayed decision-making, missed opportunities, and a reactive rather than proactive market stance.

Key Pain Points:

Manual, Labor-Intensive Data Collection: Analysts spend countless hours manually browsing competitor websites, forums, and press releases. This is not only tedious but highly prone to human error and oversight.
Information Overload and Signal-to-Noise Ratio: The sheer volume of web data makes it incredibly difficult to discern critical changes (e.g., a subtle price adjustment, a new feature launch) from irrelevant noise. Identifying the "signal" often requires deep domain expertise and significant time investment.
Lack of Structured, Actionable Insights: Raw web data, even when collected, is unstructured. It requires further processing, aggregation, and synthesis to become genuinely actionable. Businesses struggle to transform qualitative observations into quantitative, comparable data points.
Slow Reaction Times: By the time a significant competitive move is identified and analyzed, the opportunity to react strategically might have passed. This leads to a loss of competitive edge, market share, and brand perception.
Inconsistent Data Quality: Manual processes lead to inconsistent data capture, making historical trend analysis and reliable comparisons challenging.
Resource Drain: Dedicating skilled human capital to repetitive data extraction tasks diverts resources from higher-value strategic analysis and innovation.
Difficulty in Feature and Pricing Comparison: Establishing a clear, apples-to-apples comparison of competitor features, pricing tiers, and bundles is complex and rarely automated, making strategic product positioning and pricing optimization a continuous struggle.

The "Competitive Intelligence Tracker" directly addresses these pain points by automating the entire intelligence lifecycle, from data ingestion and transformation to analysis and dissemination, ensuring businesses are always equipped with a structured, near real-time understanding of their competitive landscape. It aims to empower strategic teams with the insights needed to anticipate market shifts, refine product roadmaps, and optimize pricing strategies, all while drastically reducing manual effort and increasing data reliability.

2. Solution Overview

The Competitive Intelligence Tracker is an advanced, AI-powered web application designed to systematically monitor, extract, analyze, and disseminate critical competitive information from unstructured web data. Its core value proposition lies in delivering "Structured intelligence from unstructured web data," providing "Guaranteed JSON outputs" for seamless integration and reliable data consumption. The system acts as a vigilant digital analyst, continuously observing competitor websites and generating actionable insights, summarized reports, and immediate alerts.

At a high level, the solution operates through a robust pipeline:

Configuration: Users define competitors (URLs), key areas of interest (e.g., pricing pages, feature lists), and desired extraction schemas.
Automated Data Ingestion: A scheduled, intelligent web scraping engine regularly visits specified competitor web pages.
Change Detection: Advanced algorithms efficiently identify new content or significant alterations on monitored pages, minimizing redundant processing.
AI-Powered Extraction & Structuring: Leveraging state-of-the-art Generative AI (Gemini via Vertex AI), the system processes raw web content (HTML, text) to extract specific data points (prices, features, descriptions) into rigorously defined JSON schemas. This ensures "guaranteed JSON outputs," crucial for downstream analytical consistency.
Intelligence Generation:
- Pricing Change Alerts: Automatic detection and notification of price adjustments.
- Feature Comparison Matrix: Aggregation and normalization of competitor features for side-by-side analysis.
- AI-Generated Summaries: Concise, digestible overviews of detected changes.
Data Persistence & Analytics: All extracted, structured data is stored in a robust relational database (Postgres via Supabase), enabling historical tracking, trend analysis, and rich querying.
Notification & Reporting:
- Weekly Digest Emails: Comprehensive summaries of competitive movements delivered periodically.
- Real-time Alerts: Instant notifications for critical changes (e.g., significant pricing shifts, major feature launches).
- Interactive Dashboard: A user-friendly Next.js frontend for visualizing intelligence, configuring tracking, and exploring data.

This integrated approach transforms disparate web noise into a coherent, strategic advantage, enabling businesses to pivot quickly, seize market opportunities, and maintain a leading edge.

3. Architecture & Tech Stack Justification

The architecture for the Competitive Intelligence Tracker is designed for scalability, resilience, developer velocity, and leverages best-in-class components from the modern cloud ecosystem, particularly favoring Google Cloud Platform for its AI capabilities.

High-Level Architecture Diagram (Conceptual):

+------------------+     +-----------------------+     +-------------------+
| User (Browser/UI)| --> | Next.js Frontend/APIs | --> | Cloud Run (Scraper|
+------------------+     +-----------------------+     |   Workers)        |
         ^                       ^        |            +-------------------+
         |                       |        |                     |
         | (Realtime)            |        | (API Calls)         | (Scraped HTML)
         |                       |        |                     v
         |                       |        +-----------> +-------------------+
         |                       |                     | Supabase Storage  |
         |                       |                     | (Raw HTML/Assets) |
         |                       |                     +-------------------+
         |                       |                             |
         |                       |                             v
+-----------------------+        |                     +-------------------+
| Resend (Email Service)| <------+---------------------| Supabase Database |
+-----------------------+        |                     | (Postgres: Data,  |
         ^                       |                     |  Auth, RLS)       |
         |                       |                     +-------------------+
         | (Notifications)       |                             ^
         |                       |                             | (Structured Data, Embeddings)
         |                       |                             |
         |                       +-----------------------------+
         |                                                     |
         |                                                     v
+-----------------------+                       +----------------------------------+
| Cloud Scheduler       | --------------------->| Cloud Run (AI Processing       |
| (Triggers Scrapes,    | (Pub/Sub for tasks)   |   & Transformation Services)   |
|  Digests)             |                       |                                  |
+-----------------------+                       |    +-------------------------+   |
                                                |    | Vertex AI (Gemini:      |   |
                                                |    |   Extraction, Summary,  |   |
                                                |    |   Comparison)           |   |
                                                |    |                         |   |
                                                |    | (Vector Search: Embeddings) |
                                                |    +-------------------------+   |
                                                +----------------------------------+

Tech Stack Justification:

Next.js (Frontend & Backend APIs):
- Justification: Provides a powerful full-stack framework for building the user interface and API layer. React for a rich, interactive frontend experience. Server-Side Rendering (SSR) and Static Site Generation (SSG) capabilities enhance performance, SEO, and user experience for the dashboard. Next.js API routes are ideal for handling backend logic, authenticating users, and orchestrating calls to other microservices. Its file-system-based routing and built-in optimizations accelerate development.
- Role: User authentication, interactive dashboard for competitor management, data visualization, real-time updates via Supabase Realtime, and the primary API gateway.
Vertex AI (Generative AI Studio - Gemini, Vector Search):
- Justification: Google's unified ML platform offers unparalleled capabilities for advanced AI tasks.
  - Gemini (Generative AI Studio): Essential for the core intelligence generation. Its multimodal capabilities (though primarily text here), advanced reasoning, and crucial JSON_MODE provide the "guaranteed JSON outputs" feature. This is critical for reliable data extraction, summarization, and comparative analysis from unstructured web content. Fine-tuning options are available if base models aren't sufficient.
  - Vector Search (Matching Engine): Provides highly scalable and efficient similarity search for vector embeddings. This is vital for normalizing extracted features (identifying semantically similar features despite different naming conventions) and potentially for advanced change detection (semantic diffing).
- Role: AI-powered extraction of structured data, AI-generated summaries, feature normalization, semantic comparison, intelligent change detection.
Supabase (Postgres Database, Auth, Storage, Realtime):
- Justification: An open-source Firebase alternative providing a robust, scalable Postgres database with built-in features crucial for rapid development and production readiness.
  - Postgres: A highly reliable, ACID-compliant relational database perfect for storing structured competitive intelligence, historical data, user configurations, and audit logs. Its extensibility (e.g., pg_cron for internal scheduling, pg_net for webhooks) is a bonus.
  - Auth: Simplifies user management, authentication (email/password, OAuth), and authorization with Row Level Security (RLS), crucial for multi-tenancy or user-specific data access.
  - Storage: Securely stores raw scraped HTML, screenshots, or any associated media, providing a durable and scalable object storage solution linked to the database.
  - Realtime: Enables instant updates to the Next.js frontend when new data (e.g., a pricing alert) is inserted into the database, enhancing the user experience.
- Role: Primary data store, user management, secure storage for raw scraped data, real-time UI updates.
Resend (Transactional Email API):
- Justification: A developer-friendly, high-deliverability email API designed specifically for transactional emails. For critical alerts and weekly digests, ensuring emails reach the inbox reliably and promptly is paramount. Its simple API and focus on performance make it ideal for notifications.
- Role: Sending pricing change alerts, weekly digest emails, and potentially user-related communications (e.g., password resets).

Supporting Infrastructure (Google Cloud):

Cloud Run: Serverless compute platform for containerized applications. Ideal for stateless, event-driven microservices like scraper workers, AI processing services, and API endpoints not handled by Next.js API routes. Scales automatically from zero, paying only for execution time.
Cloud Scheduler: Managed cron job service. Triggers scheduled tasks such as initiating web scrapes, generating weekly digests, and cleaning up old data.
Cloud Pub/Sub: Asynchronous messaging service. Decouples the scraping process from AI processing and notification, providing robust, scalable, and resilient task queues.
Cloud Monitoring / Logging / Error Reporting: Essential for observability, debugging, and maintaining the health of the application in production.

This combination creates a powerful, flexible, and cost-effective architecture that scales from initial prototype to enterprise-grade competitive intelligence solution.

4. Core Feature Implementation Guide

This section details the implementation strategies for the most critical features, emphasizing the pipelines and AI interactions.

4.1. Data Ingestion Pipeline (Scraping & Change Detection)

The ingestion pipeline is the bedrock, responsible for reliably collecting raw web data and identifying meaningful changes.

Pipeline Flow:

Scheduled Trigger: Cloud Scheduler (e.g., daily, hourly) sends a Pub/Sub message to a topic scraper-start.
Scraper Worker Activation: A Cloud Run service subscribes to scraper-start. For each configured competitor URL, it spins up a new scraping task.
Web Scraping: Each scraper task (a Cloud Run container) uses a headless browser (Puppeteer/Playwright) for dynamic content or Axios/Cheerio for static content. It visits the target URL.
- Robustness: Implement retry logic, user-agent rotation, proxy integration (if necessary), and CAPTCHA detection/handling strategies.
- Content Capture: Stores the raw HTML content (and optionally a screenshot) in Supabase Storage.
Change Detection:
- Hashing: Generate a SHA256 hash of the cleaned (e.g., whitespace normalized, script/style tags removed) HTML content.
- Comparison: Retrieve the last known hash for that URL from Supabase Database.
- DOM Diffing (Advanced): For more granular changes, a library like diff-dom or a custom heuristic can compare the current DOM tree with the previous one, identifying specific element changes.
- Trigger AI: If a significant change is detected (hash mismatch, or relevant DOM diff), a Pub/Sub message is sent to ai-processing-trigger with the raw_html_storage_path and competitor_id.
Storage: The new HTML content and its hash are stored in Supabase Storage and the database respectively, linking to competitor_id and timestamp.

Pseudo-code for Change Detection:

// Supabase ORM (pseudo)
interface PageSnapshot {
  id: string;
  competitor_id: string;
  url: string;
  timestamp: Date;
  html_storage_path: string;
  html_hash: string;
}

async function processScrapedPage(competitorId: string, url: string, rawHtml: string): Promise<void> {
  const cleanedHtml = cleanHtmlForHashing(rawHtml); // Remove scripts, styles, normalize whitespace
  const currentHash = generateSha256(cleanedHtml);

  // Retrieve last snapshot from Supabase
  const lastSnapshot: PageSnapshot | null = await supabase.from('page_snapshots')
    .select('*')
    .eq('competitor_id', competitorId)
    .eq('url', url)
    .order('timestamp', { ascending: false })
    .limit(1)
    .single();

  if (lastSnapshot && lastSnapshot.html_hash === currentHash) {
    console.log(`No change detected for ${url}.`);
    return; // No change, exit
  }

  // Store new HTML to Supabase Storage
  const storagePath = `html_snapshots/${competitorId}/${Date.now()}.html`;
  await supabase.storage.from('raw_html').upload(storagePath, rawHtml, { contentType: 'text/html' });

  // Insert new snapshot into Supabase Database
  const newSnapshot: PageSnapshot = {
    competitor_id: competitorId,
    url: url,
    timestamp: new Date(),
    html_storage_path: storagePath,
    html_hash: currentHash,
  };
  await supabase.from('page_snapshots').insert(newSnapshot);

  // Trigger AI processing for the detected change
  await pubsub.topic('ai-processing-trigger').publish({
    competitorId,
    url,
    newSnapshotId: newSnapshot.id,
    previousSnapshotId: lastSnapshot?.id, // Pass previous for diffing
  });

  console.log(`Change detected and AI processing triggered for ${url}`);
}

4.2. Guaranteed JSON Outputs (via Gemini & Vertex AI)

This is a cornerstone feature. Gemini's JSON_MODE and function calling are leveraged to enforce strict JSON schema adherence.

Pipeline Flow:

AI Processing Trigger: Pub/Sub message ai-processing-trigger activates a Cloud Run AI processing service.
HTML Retrieval: Service fetches raw_html_content from Supabase Storage using the provided html_storage_path.
Prompt Construction: The service constructs a detailed prompt for Gemini, incorporating:
- System Instruction: Defines Gemini's role (e.g., "You are an expert market analyst extracting competitive data.").
- User Instruction: Specifies the task (e.g., "Extract all pricing plans and their features from the provided HTML content.").
- Context: The cleaned text content of the HTML.
- JSON Schema: The explicit schema defining the expected output structure.
Gemini API Call: Call Vertex AI Gemini with the prompt and specify JSON_MODE.
- Function Calling (Alternative/Enhancement): For highly structured or multi-step extractions, define specific tools (functions) Gemini can call, each returning structured data.
Schema Validation: Post-generation, the received JSON is validated against the defined schema using a library like Zod or Pydantic.
Error Handling & Retries: If validation fails, log the error, potentially retry with a refined prompt, or flag for human review.
Data Storage: Store the validated structured JSON data (e.g., extracted_pricing_data, extracted_features) in Supabase.

Pseudo-code for Gemini Extraction:

import { GoogleGenerativeAI } from '@google/generative-ai';
import { z } from 'zod'; // For schema validation

const genAI = new GoogleGenerativeAI(process.env.VERTEX_AI_API_KEY!);
const model = genAI.getGenerativeModel({ model: 'gemini-pro', generationConfig: { responseMimeType: 'application/json' } });

// Define Zod schemas for guaranteed output
const pricingSchema = z.array(z.object({
  planName: z.string().describe("Name of the pricing plan, e.g., 'Starter', 'Pro'"),
  price: z.union([z.string(), z.number()]).describe("Price of the plan, e.g., '$29', 'Free', 19.99"),
  currency: z.string().optional().describe("Currency symbol or code, e.g., '$', 'USD'"),
  billingCycle: z.string().optional().describe("Billing cycle, e.g., 'per month', 'annually'"),
  features: z.array(z.string()).optional().describe("List of features included in this plan"),
  url: z.string().url().optional().describe("Direct URL to the pricing plan if available")
}));

const featureSchema = z.array(z.object({
  featureName: z.string().describe("Name of the feature, e.g., 'Unlimited Users', 'API Access'"),
  description: z.string().optional().describe("Short description of the feature"),
  availability: z.array(z.string()).optional().describe("Which plans this feature is available in"),
}));

async function extractStructuredData(htmlContent: string, competitorId: string, snapshotId: string): Promise<void> {
  const prompt = `
    You are an expert market analyst.
    Given the following HTML content from a competitor's website (Competitor ID: ${competitorId}, Snapshot ID: ${snapshotId}),
    extract all distinct pricing plans, their prices, billing cycles, currencies, and associated features.
    Additionally, extract all key features mentioned on the page, regardless of pricing plan, along with their descriptions and which plans they are available in.

    Provide the output as a single JSON object with two top-level keys: "pricing" and "features".
    Each array element for "pricing" should strictly adhere to the following schema:
    ${JSON.stringify(pricingSchema.element.describe(), null, 2)}

    Each array element for "features" should strictly adhere to the following schema:
    ${JSON.stringify(featureSchema.element.describe(), null, 2)}

    If any information is not explicitly found, use 'null' or omit optional fields. Do not hallucinate.
    HTML Content:
    ${htmlContent}
  `;

  try {
    const result = await model.generateContent(prompt);
    const textResponse = result.response.text();
    const parsedJson = JSON.parse(textResponse);

    // Validate against Zod schemas
    const validatedPricing = pricingSchema.parse(parsedJson.pricing);
    const validatedFeatures = featureSchema.parse(parsedJson.features);

    // Store validated data in Supabase
    await supabase.from('extracted_pricing').insert({
      snapshot_id: snapshotId,
      data: validatedPricing,
    });
    await supabase.from('extracted_features').insert({
      snapshot_id: snapshotId,
      data: validatedFeatures,
    });

    console.log(`Successfully extracted and validated data for snapshot ${snapshotId}`);

  } catch (error) {
    console.error(`Error during Gemini extraction or validation for snapshot ${snapshotId}:`, error);
    // Implement retry logic or alert for manual review
  }
}

4.3. Pricing Change Alerts

Leverages the structured pricing data.

Logic:

Detection Trigger: After extractStructuredData successfully stores new pricing data, a new event pricing-extracted is published to Pub/Sub.
Alert Service: A Cloud Run service subscribes to pricing-extracted.
Historical Comparison: Retrieve the previous extracted_pricing record for the same competitor.
Delta Calculation: Compare the new pricing JSON with the old. Identify:
- New plans added.
- Plans removed.
- Price changes (numerical difference, percentage change).
- Feature changes within a plan.
- deep-diff or custom comparison logic can be used.
Thresholding: Configure user-defined thresholds (e.g., "alert if price changes by >5%").
Notification: If a significant change is detected, compose an email using Resend, detailing the change. Store the alert in Supabase (e.g., alerts table).

Pseudo-code for Pricing Comparison:

import { Resend } from 'resend';
const resend = new Resend(process.env.RESEND_API_KEY);

async function checkPricingChanges(newSnapshotId: string, previousSnapshotId: string | null, competitorId: string): Promise<void> {
  const newPricing = (await supabase.from('extracted_pricing').select('data').eq('snapshot_id', newSnapshotId).single()).data;
  if (!newPricing) return;

  if (!previousSnapshotId) {
    console.log(`First pricing extraction for ${competitorId}. No previous data to compare.`);
    // Optionally send an "initial extraction" alert
    return;
  }

  const oldPricing = (await supabase.from('extracted_pricing').select('data').eq('snapshot_id', previousSnapshotId).single()).data;
  if (!oldPricing) {
    console.warn(`Could not find previous pricing data for ${competitorId}/${previousSnapshotId}`);
    return;
  }

  let changes: string[] = [];

  // Implement detailed comparison logic here.
  // Example: Find changed prices
  newPricing.forEach((newPlan: any) => {
    const oldPlan = oldPricing.find((op: any) => op.planName === newPlan.planName);
    if (oldPlan) {
      if (newPlan.price !== oldPlan.price || newPlan.billingCycle !== oldPlan.billingCycle) {
        changes.push(`${newPlan.planName}: Price changed from ${oldPlan.price} ${oldPlan.billingCycle || ''} to ${newPlan.price} ${newPlan.billingCycle || ''}.`);
      }
      // Add logic for feature changes within a plan
    } else {
      changes.push(`New plan detected: ${newPlan.planName} at ${newPlan.price}.`);
    }
  });

  oldPricing.forEach((oldPlan: any) => {
    if (!newPricing.some((np: any) => np.planName === oldPlan.planName)) {
      changes.push(`Plan removed: ${oldPlan.planName}.`);
    }
  });

  if (changes.length > 0) {
    await supabase.from('alerts').insert({
      competitor_id: competitorId,
      alert_type: 'pricing_change',
      summary: `Pricing changes detected for ${competitorId}: ${changes.join(' ')}`,
      details: { changes },
      timestamp: new Date(),
    });

    // Send email via Resend
    await resend.emails.send({
      from: 'alerts@yourdomain.com',
      to: 'user@example.com', // Dynamically fetch user emails
      subject: `[Competitive Intelligence] Pricing Alert: ${competitorId}`,
      html: `<strong>Pricing Changes:</strong><ul>${changes.map(c => `<li>${c}</li>`).join('')}</ul>`,
    });
  }
}

4.4. Feature Comparison Matrix

This requires normalizing features to enable meaningful comparisons.

Logic:

Extraction: As part of the extractStructuredData process, extracted_features are stored.
Embedding Generation: For each extracted featureName and description, generate a vector embedding using Vertex AI Embedding API. Store these embeddings alongside the feature in Supabase (e.g., feature_embeddings table).
Similarity Search (Normalization):
- When displaying the matrix, or periodically, compare new feature embeddings against existing "canonical" features (user-defined or auto-clustered).
- Use Vertex AI Vector Search (Matching Engine) to find the most similar existing feature.
- Allow users to manually map similar features or confirm auto-suggested mappings. This creates a normalized_feature_id.
Matrix Generation (Frontend): The Next.js frontend queries extracted_features and normalized_features. It groups features by their normalized_feature_id and then displays availability across different competitors and their plans.

Schema (Supabase):

competitors table
extracted_features table (link to snapshot_id, competitor_id, feature_name, description, availability, embedding_vector)
normalized_features table (id, canonical_name, user_defined, embedding_vector)
feature_mappings table (extracted_feature_id, normalized_feature_id)

4.5. AI-Generated Summaries

Concise, human-readable summaries of detected changes.

Logic:

Trigger: After any significant change (pricing, new features, content updates) is processed and stored.
Context Assembly: Gather relevant data:
- Raw HTML diff (if available).
- Structured pricing changes.
- Structured new/removed features.
- Previous and current page snapshots.

Gemini Prompt:

"You are an expert market analyst summarizing key competitive intelligence.
Given the following detected changes for Competitor [Competitor Name] on [URL] (from [Old Snapshot Date] to [New Snapshot Date]):

[Detailed list of extracted pricing changes]
[Detailed list of new/removed features]
[Relevant snippets from HTML diff, if applicable]

Generate a concise summary in 3-5 bullet points, highlighting the most impactful changes, especially new features, pricing adjustments, or major strategic shifts. Focus on actionable insights.
"

Generation: Call Vertex AI Gemini.
Storage: Store the generated summary in Supabase, linked to the alert or snapshot_id. Display in the UI and include in weekly digests.

4.6. Weekly Digest Emails

Aggregated report of all competitive activity over the past week.

Logic:

Scheduled Trigger: Cloud Scheduler triggers a Cloud Run service weekly (e.g., Monday morning).
Data Aggregation: The service queries Supabase for:
- All new alerts (pricing changes, major feature launches) from the last 7 days.
- All ai_generated_summaries from the last 7 days.
- New competitors added or pages monitored.
Email Content Generation: Construct a dynamic HTML email template. Iterate through the aggregated data, presenting it clearly:
- "Top 3 Changes This Week" (based on severity/impact score).
- Per-competitor breakdown with bulleted summaries.
- Links back to the dashboard for detailed analysis.
Email Sending: Use Resend to send the personalized digest to all subscribed users.

5. Gemini Prompting Strategy

Effective prompting is paramount for achieving the "Guaranteed JSON Outputs" and high-quality summaries. Our strategy focuses on clarity, explicit instruction, and leveraging Gemini's structural capabilities.

Core Principles:

Role Definition: Always define Gemini's persona. This sets the context and tone for its responses.
- Example: "You are an expert market analyst specialized in competitive intelligence. Your task is to meticulously extract specific data points from web content."
Clear Task Specification: State precisely what needs to be done.
- Example: "Identify all pricing plans, their associated features, prices, and billing frequencies. Additionally, list all distinct product features mentioned on the page, irrespective of plan, along with a brief description and their availability across plans."
Contextualization: Provide all necessary information, including source URL, competitor name, and the full raw text content. Filtering out irrelevant HTML (scripts, styles) beforehand can improve focus.
- Example: "Analyze the following HTML content from [Competitor Name]'s pricing page ([URL]): \nhtml\n[HTML Content]\n"

Strict Output Format Specification (JSON_MODE): This is critical.

Declare JSON_MODE: Ensure the API call explicitly requests JSON. Vertex AI's responseMimeType: 'application/json' in generationConfig is key.
Provide JSON Schema: Embed a clear, well-defined JSON schema directly in the prompt. This acts as a contract.
- Use description fields within the schema to guide Gemini on the expected content for each field.
- Specify type for all fields (string, number, boolean, array, object).
- Define optionality.

Example Snippet for Pricing Schema in Prompt:

{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "planName": { "type": "string", "description": "The name of the pricing plan (e.g., 'Basic', 'Premium')." },
      "monthlyPrice": { "type": ["number", "null"], "description": "The numeric monthly price if available, otherwise null." },
      "yearlyPrice": { "type": ["number", "null"], "description": "The numeric annual price if available, otherwise null." },
      "currency": { "type": "string", "description": "The currency symbol or code (e.g., '$', 'USD')." },
      "featuresIncluded": { "type": "array", "items": { "type": "string" }, "description": "A list of key features included in this plan." }
    },
    "required": ["planName", "currency"]
  }
}

Constraint Definition: Explicitly state what not to do, or what to do if information is missing.
- Examples:
  - "Do not invent information. If a piece of data is not explicitly present, return null for that field or omit optional fields."
  - "Only extract pricing tiers and features visible to regular users, ignore any 'enterprise contact us' forms unless specified."
  - "Prioritize the most current pricing structure if multiple are present."
Few-Shot Examples (Conditional): For highly nuanced extractions or when dealing with diverse webpage layouts, providing 1-2 examples of input text and their desired JSON output can significantly improve accuracy. This is particularly useful if the base model struggles with edge cases.
- Caveat: Increases token usage and prompt length, so use judiciously.
Iterative Refinement:
- Test with diverse data: Don't just test with one competitor.
- Analyze failures: If Gemini provides incorrect JSON or misses data, examine the specific input and prompt.
- Adjust Prompt:
  - Add more specific instructions.
  - Refine schema descriptions.
  - Add negative constraints (e.g., "Exclude X if present").
  - Consider preprocessing the input HTML (e.g., removing boilerplate, ads) to reduce noise.
- Version Control Prompts: Treat prompts as code; store them, version them, and track their performance.
Function Calling (Advanced): For complex workflows, define distinct "tools" (functions) Gemini can invoke. For instance, extract_pricing(html_content) and extract_features(html_content). Gemini can decide which tool to call based on the overall instruction, and each tool can have its own rigid JSON schema. This modularizes the extraction process.

By diligently applying these strategies, we ensure that the AI component consistently delivers high-quality, structured intelligence, which is the cornerstone of the Competitive Intelligence Tracker.

6. Deployment & Scaling

The deployment and scaling strategy focuses on leveraging serverless technologies for cost-efficiency and automatic scaling, while ensuring robustness and observability.

6.1. Local Development

Frontend: Standard npm run dev for the Next.js application. Hot reloading ensures a smooth development experience.
Backend (APIs/Workers): Cloud Run services can be run locally using gcloud emulators run or directly as Node.js processes, simulating their production environment.
Database: Supabase offers a local development setup using Docker Compose, providing a local Postgres instance, Auth, and Storage. This mirrors the production database schema.
AI: Use mocked responses for Gemini during initial development or directly call the Vertex AI API with development keys.
Resend: Use a development email service like Mailtrap or Ethereal for local email testing.

6.2. CI/CD Pipeline

Version Control: GitHub or Google Cloud Source Repositories.
Automated Builds: GitHub Actions or Cloud Build.
- Linting & Formatting: Ensure code quality and consistency (ESLint, Prettier).
- Unit & Integration Tests: Run tests for Next.js components, API routes, and core business logic (e.g., data comparison functions).
- Docker Builds: Build Docker images for Cloud Run services (scrapers, AI processors).
- Deployment:
  - Next.js Frontend: Deploy to Vercel, Cloudflare Pages, or Google Cloud Run (as a static-site/SSR container).
  - Cloud Run Services: Deploy Docker images to Cloud Run.
  - Supabase Database Migrations: Apply schema migrations using Supabase CLI or migrate command within CI/CD.

6.3. Database Scaling (Supabase Postgres)

Vertical Scaling: Supabase offers various compute add-ons for increasing CPU and RAM as load grows.
Read Replicas: For read-heavy workloads (e.g., dashboard queries, digest generation), provision read replicas to offload the primary database.
Connection Pooling: Supabase integrates PgBouncer, which efficiently manages database connections, preventing connection saturation under high load.
Row Level Security (RLS): Crucial for multi-tenant applications, RLS ensures users only access data they are authorized for, handled efficiently at the database level.
Indexing: Proactively create appropriate indexes on frequently queried columns (e.g., competitor_id, timestamp, url) to optimize query performance.

6.4. AI Scaling (Vertex AI)

Automatic Scaling: Vertex AI services (Gemini, Vector Search, Embeddings) inherently scale based on demand. There's no manual scaling configuration required for the AI models themselves.
Quota Management: Monitor and request quota increases for Vertex AI API calls as user base and monitoring scope grow.
Batch Processing: For large volumes of text, optimize API calls by batching multiple documents for embedding or summarization where possible, reducing overhead and potentially cost.
Caching: Cache AI responses for identical inputs (e.g., summaries of static content) to reduce redundant calls and costs.

6.5. Data Ingestion Scaling (Scrapers)

Horizontal Scaling with Cloud Run: Scraper services are deployed as Cloud Run containers. Triggering them via Pub/Sub allows for massive horizontal scaling. Each Pub/Sub message can represent a single scraping job, leading to many concurrent scraper instances.
Distributed Task Queue (Pub/Sub): Pub/Sub acts as a robust message broker, distributing scraping tasks across multiple Cloud Run instances. It provides retry mechanisms and guarantees message delivery.
Rate Limiting & Concurrency: Implement rate limiting in scraper services to avoid overwhelming target websites. Configure Cloud Run concurrency for scraper containers appropriately.
Proxy Rotation (Advanced): For aggressive scraping or to mitigate IP blocking, integrate a third-party proxy service with IP rotation capabilities.
Smart Scraping: Only re-scrape pages that are likely to have changed. Use change detection early in the pipeline to avoid unnecessary AI processing.

6.6. Monitoring & Observability

Google Cloud Monitoring & Logging: Centralized logging for all Cloud Run services, Pub/Sub, and Cloud Scheduler. Set up dashboards and alerts for:
- Scraping success rates (e.g., 200 OK vs. 4xx/5xx errors).
- AI processing latency and error rates.
- Database connection utilization and query performance.
- Alert delivery success/failure.
- Cloud Run instance count and CPU/memory usage.
Error Reporting: Automatically capture and group application errors from Cloud Run services.
Custom Metrics: Instrument code to emit custom metrics for business-specific KPIs, such as "number of new pricing changes detected per day."
Uptime Monitoring: External checks for the Next.js application and critical API endpoints.

6.7. Security

End-to-End Encryption: HTTPS for all client-server communication. Data at rest encryption provided by Supabase and Google Cloud Storage.
Authentication & Authorization: Supabase Auth for user management. Implement Row Level Security (RLS) on Supabase tables to enforce data isolation between users/tenants.
API Key Management: Store sensitive API keys (Vertex AI, Resend) securely in Google Cloud Secret Manager. Access them securely from Cloud Run services via environment variables.
Input Sanitization: Validate and sanitize all user inputs to prevent SQL injection, XSS, and other vulnerabilities.
Least Privilege: Configure IAM roles for Cloud Run services and other GCP resources with the minimum necessary permissions.
Vulnerability Scanning: Regularly scan dependencies and Docker images for known vulnerabilities.

6.8. Cost Optimization

Serverless First: Cloud Run, Pub/Sub, Cloud Scheduler, and Vertex AI are "pay-per-use" services, automatically scaling down to zero when not in use, significantly reducing idle costs.
Efficient AI Calls: Optimize Gemini prompts to be concise, minimize token usage, and batch requests where possible. Cache AI responses for repeated queries.
Smart Scraping: Prioritize scraping based on perceived change likelihood or user-defined importance. Avoid redundant scraping of unchanged content.
Database Sizing: Start with a smaller Supabase instance and scale up as needed, monitoring resource utilization.
Storage Lifecycles: Implement lifecycle rules for Supabase Storage to automatically delete or archive old raw HTML snapshots after a defined period (e.g., 3-6 months), reducing storage costs.

1. The Business Problem (Why build this?)

Key Pain Points:

Manual, Labor-Intensive Data Collection: Analysts spend countless hours manually browsing competitor websites, forums, and press releases. This is not only tedious but highly prone to human error and oversight.
Information Overload and Signal-to-Noise Ratio: The sheer volume of web data makes it incredibly difficult to discern critical changes (e.g., a subtle price adjustment, a new feature launch) from irrelevant noise. Identifying the "signal" often requires deep domain expertise and significant time investment.
Lack of Structured, Actionable Insights: Raw web data, even when collected, is unstructured. It requires further processing, aggregation, and synthesis to become genuinely actionable. Businesses struggle to transform qualitative observations into quantitative, comparable data points.
Slow Reaction Times: By the time a significant competitive move is identified and analyzed, the opportunity to react strategically might have passed. This leads to a loss of competitive edge, market share, and brand perception.
Inconsistent Data Quality: Manual processes lead to inconsistent data capture, making historical trend analysis and reliable comparisons challenging.
Resource Drain: Dedicating skilled human capital to repetitive data extraction tasks diverts resources from higher-value strategic analysis and innovation.
Difficulty in Feature and Pricing Comparison: Establishing a clear, apples-to-apples comparison of competitor features, pricing tiers, and bundles is complex and rarely automated, making strategic product positioning and pricing optimization a continuous struggle.

2. Solution Overview

At a high level, the solution operates through a robust pipeline:

Configuration: Users define competitors (URLs), key areas of interest (e.g., pricing pages, feature lists), and desired extraction schemas.
Automated Data Ingestion: A scheduled, intelligent web scraping engine regularly visits specified competitor web pages.
Change Detection: Advanced algorithms efficiently identify new content or significant alterations on monitored pages, minimizing redundant processing.
AI-Powered Extraction & Structuring: Leveraging state-of-the-art Generative AI (Gemini via Vertex AI), the system processes raw web content (HTML, text) to extract specific data points (prices, features, descriptions) into rigorously defined JSON schemas. This ensures "guaranteed JSON outputs," crucial for downstream analytical consistency.
Intelligence Generation:
- Pricing Change Alerts: Automatic detection and notification of price adjustments.
- Feature Comparison Matrix: Aggregation and normalization of competitor features for side-by-side analysis.
- AI-Generated Summaries: Concise, digestible overviews of detected changes.
Data Persistence & Analytics: All extracted, structured data is stored in a robust relational database (Postgres via Supabase), enabling historical tracking, trend analysis, and rich querying.
Notification & Reporting:
- Weekly Digest Emails: Comprehensive summaries of competitive movements delivered periodically.
- Real-time Alerts: Instant notifications for critical changes (e.g., significant pricing shifts, major feature launches).
- Interactive Dashboard: A user-friendly Next.js frontend for visualizing intelligence, configuring tracking, and exploring data.

This integrated approach transforms disparate web noise into a coherent, strategic advantage, enabling businesses to pivot quickly, seize market opportunities, and maintain a leading edge.

3. Architecture & Tech Stack Justification

High-Level Architecture Diagram (Conceptual):

+------------------+     +-----------------------+     +-------------------+
| User (Browser/UI)| --> | Next.js Frontend/APIs | --> | Cloud Run (Scraper|
+------------------+     +-----------------------+     |   Workers)        |
         ^                       ^        |            +-------------------+
         |                       |        |                     |
         | (Realtime)            |        | (API Calls)         | (Scraped HTML)
         |                       |        |                     v
         |                       |        +-----------> +-------------------+
         |                       |                     | Supabase Storage  |
         |                       |                     | (Raw HTML/Assets) |
         |                       |                     +-------------------+
         |                       |                             |
         |                       |                             v
+-----------------------+        |                     +-------------------+
| Resend (Email Service)| <------+---------------------| Supabase Database |
+-----------------------+        |                     | (Postgres: Data,  |
         ^                       |                     |  Auth, RLS)       |
         |                       |                     +-------------------+
         | (Notifications)       |                             ^
         |                       |                             | (Structured Data, Embeddings)
         |                       |                             |
         |                       +-----------------------------+
         |                                                     |
         |                                                     v
+-----------------------+                       +----------------------------------+
| Cloud Scheduler       | --------------------->| Cloud Run (AI Processing       |
| (Triggers Scrapes,    | (Pub/Sub for tasks)   |   & Transformation Services)   |
|  Digests)             |                       |                                  |
+-----------------------+                       |    +-------------------------+   |
                                                |    | Vertex AI (Gemini:      |   |
                                                |    |   Extraction, Summary,  |   |
                                                |    |   Comparison)           |   |
                                                |    |                         |   |
                                                |    | (Vector Search: Embeddings) |
                                                |    +-------------------------+   |
                                                +----------------------------------+

Tech Stack Justification:

Next.js (Frontend & Backend APIs):
- Justification: Provides a powerful full-stack framework for building the user interface and API layer. React for a rich, interactive frontend experience. Server-Side Rendering (SSR) and Static Site Generation (SSG) capabilities enhance performance, SEO, and user experience for the dashboard. Next.js API routes are ideal for handling backend logic, authenticating users, and orchestrating calls to other microservices. Its file-system-based routing and built-in optimizations accelerate development.
- Role: User authentication, interactive dashboard for competitor management, data visualization, real-time updates via Supabase Realtime, and the primary API gateway.
Vertex AI (Generative AI Studio - Gemini, Vector Search):
- Justification: Google's unified ML platform offers unparalleled capabilities for advanced AI tasks.
  - Gemini (Generative AI Studio): Essential for the core intelligence generation. Its multimodal capabilities (though primarily text here), advanced reasoning, and crucial JSON_MODE provide the "guaranteed JSON outputs" feature. This is critical for reliable data extraction, summarization, and comparative analysis from unstructured web content. Fine-tuning options are available if base models aren't sufficient.
  - Vector Search (Matching Engine): Provides highly scalable and efficient similarity search for vector embeddings. This is vital for normalizing extracted features (identifying semantically similar features despite different naming conventions) and potentially for advanced change detection (semantic diffing).
- Role: AI-powered extraction of structured data, AI-generated summaries, feature normalization, semantic comparison, intelligent change detection.
Supabase (Postgres Database, Auth, Storage, Realtime):
- Justification: An open-source Firebase alternative providing a robust, scalable Postgres database with built-in features crucial for rapid development and production readiness.
  - Postgres: A highly reliable, ACID-compliant relational database perfect for storing structured competitive intelligence, historical data, user configurations, and audit logs. Its extensibility (e.g., pg_cron for internal scheduling, pg_net for webhooks) is a bonus.
  - Auth: Simplifies user management, authentication (email/password, OAuth), and authorization with Row Level Security (RLS), crucial for multi-tenancy or user-specific data access.
  - Storage: Securely stores raw scraped HTML, screenshots, or any associated media, providing a durable and scalable object storage solution linked to the database.
  - Realtime: Enables instant updates to the Next.js frontend when new data (e.g., a pricing alert) is inserted into the database, enhancing the user experience.
- Role: Primary data store, user management, secure storage for raw scraped data, real-time UI updates.
Resend (Transactional Email API):
- Justification: A developer-friendly, high-deliverability email API designed specifically for transactional emails. For critical alerts and weekly digests, ensuring emails reach the inbox reliably and promptly is paramount. Its simple API and focus on performance make it ideal for notifications.
- Role: Sending pricing change alerts, weekly digest emails, and potentially user-related communications (e.g., password resets).

Supporting Infrastructure (Google Cloud):

Cloud Run: Serverless compute platform for containerized applications. Ideal for stateless, event-driven microservices like scraper workers, AI processing services, and API endpoints not handled by Next.js API routes. Scales automatically from zero, paying only for execution time.
Cloud Scheduler: Managed cron job service. Triggers scheduled tasks such as initiating web scrapes, generating weekly digests, and cleaning up old data.
Cloud Pub/Sub: Asynchronous messaging service. Decouples the scraping process from AI processing and notification, providing robust, scalable, and resilient task queues.
Cloud Monitoring / Logging / Error Reporting: Essential for observability, debugging, and maintaining the health of the application in production.

This combination creates a powerful, flexible, and cost-effective architecture that scales from initial prototype to enterprise-grade competitive intelligence solution.

4. Core Feature Implementation Guide

This section details the implementation strategies for the most critical features, emphasizing the pipelines and AI interactions.

4.1. Data Ingestion Pipeline (Scraping & Change Detection)

The ingestion pipeline is the bedrock, responsible for reliably collecting raw web data and identifying meaningful changes.

Pipeline Flow:

Scheduled Trigger: Cloud Scheduler (e.g., daily, hourly) sends a Pub/Sub message to a topic scraper-start.
Scraper Worker Activation: A Cloud Run service subscribes to scraper-start. For each configured competitor URL, it spins up a new scraping task.
Web Scraping: Each scraper task (a Cloud Run container) uses a headless browser (Puppeteer/Playwright) for dynamic content or Axios/Cheerio for static content. It visits the target URL.
- Robustness: Implement retry logic, user-agent rotation, proxy integration (if necessary), and CAPTCHA detection/handling strategies.
- Content Capture: Stores the raw HTML content (and optionally a screenshot) in Supabase Storage.
Change Detection:
- Hashing: Generate a SHA256 hash of the cleaned (e.g., whitespace normalized, script/style tags removed) HTML content.
- Comparison: Retrieve the last known hash for that URL from Supabase Database.
- DOM Diffing (Advanced): For more granular changes, a library like diff-dom or a custom heuristic can compare the current DOM tree with the previous one, identifying specific element changes.
- Trigger AI: If a significant change is detected (hash mismatch, or relevant DOM diff), a Pub/Sub message is sent to ai-processing-trigger with the raw_html_storage_path and competitor_id.
Storage: The new HTML content and its hash are stored in Supabase Storage and the database respectively, linking to competitor_id and timestamp.

Pseudo-code for Change Detection:

// Supabase ORM (pseudo)
interface PageSnapshot {
  id: string;
  competitor_id: string;
  url: string;
  timestamp: Date;
  html_storage_path: string;
  html_hash: string;
}

async function processScrapedPage(competitorId: string, url: string, rawHtml: string): Promise<void> {
  const cleanedHtml = cleanHtmlForHashing(rawHtml); // Remove scripts, styles, normalize whitespace
  const currentHash = generateSha256(cleanedHtml);

  // Retrieve last snapshot from Supabase
  const lastSnapshot: PageSnapshot | null = await supabase.from('page_snapshots')
    .select('*')
    .eq('competitor_id', competitorId)
    .eq('url', url)
    .order('timestamp', { ascending: false })
    .limit(1)
    .single();

  if (lastSnapshot && lastSnapshot.html_hash === currentHash) {
    console.log(`No change detected for ${url}.`);
    return; // No change, exit
  }

  // Store new HTML to Supabase Storage
  const storagePath = `html_snapshots/${competitorId}/${Date.now()}.html`;
  await supabase.storage.from('raw_html').upload(storagePath, rawHtml, { contentType: 'text/html' });

  // Insert new snapshot into Supabase Database
  const newSnapshot: PageSnapshot = {
    competitor_id: competitorId,
    url: url,
    timestamp: new Date(),
    html_storage_path: storagePath,
    html_hash: currentHash,
  };
  await supabase.from('page_snapshots').insert(newSnapshot);

  // Trigger AI processing for the detected change
  await pubsub.topic('ai-processing-trigger').publish({
    competitorId,
    url,
    newSnapshotId: newSnapshot.id,
    previousSnapshotId: lastSnapshot?.id, // Pass previous for diffing
  });

  console.log(`Change detected and AI processing triggered for ${url}`);
}

4.2. Guaranteed JSON Outputs (via Gemini & Vertex AI)

This is a cornerstone feature. Gemini's JSON_MODE and function calling are leveraged to enforce strict JSON schema adherence.

Pipeline Flow:

AI Processing Trigger: Pub/Sub message ai-processing-trigger activates a Cloud Run AI processing service.
HTML Retrieval: Service fetches raw_html_content from Supabase Storage using the provided html_storage_path.
Prompt Construction: The service constructs a detailed prompt for Gemini, incorporating:
- System Instruction: Defines Gemini's role (e.g., "You are an expert market analyst extracting competitive data.").
- User Instruction: Specifies the task (e.g., "Extract all pricing plans and their features from the provided HTML content.").
- Context: The cleaned text content of the HTML.
- JSON Schema: The explicit schema defining the expected output structure.
Gemini API Call: Call Vertex AI Gemini with the prompt and specify JSON_MODE.
- Function Calling (Alternative/Enhancement): For highly structured or multi-step extractions, define specific tools (functions) Gemini can call, each returning structured data.
Schema Validation: Post-generation, the received JSON is validated against the defined schema using a library like Zod or Pydantic.
Error Handling & Retries: If validation fails, log the error, potentially retry with a refined prompt, or flag for human review.
Data Storage: Store the validated structured JSON data (e.g., extracted_pricing_data, extracted_features) in Supabase.

Pseudo-code for Gemini Extraction:

import { GoogleGenerativeAI } from '@google/generative-ai';
import { z } from 'zod'; // For schema validation

const genAI = new GoogleGenerativeAI(process.env.VERTEX_AI_API_KEY!);
const model = genAI.getGenerativeModel({ model: 'gemini-pro', generationConfig: { responseMimeType: 'application/json' } });

// Define Zod schemas for guaranteed output
const pricingSchema = z.array(z.object({
  planName: z.string().describe("Name of the pricing plan, e.g., 'Starter', 'Pro'"),
  price: z.union([z.string(), z.number()]).describe("Price of the plan, e.g., '$29', 'Free', 19.99"),
  currency: z.string().optional().describe("Currency symbol or code, e.g., '$', 'USD'"),
  billingCycle: z.string().optional().describe("Billing cycle, e.g., 'per month', 'annually'"),
  features: z.array(z.string()).optional().describe("List of features included in this plan"),
  url: z.string().url().optional().describe("Direct URL to the pricing plan if available")
}));

const featureSchema = z.array(z.object({
  featureName: z.string().describe("Name of the feature, e.g., 'Unlimited Users', 'API Access'"),
  description: z.string().optional().describe("Short description of the feature"),
  availability: z.array(z.string()).optional().describe("Which plans this feature is available in"),
}));

async function extractStructuredData(htmlContent: string, competitorId: string, snapshotId: string): Promise<void> {
  const prompt = `
    You are an expert market analyst.
    Given the following HTML content from a competitor's website (Competitor ID: ${competitorId}, Snapshot ID: ${snapshotId}),
    extract all distinct pricing plans, their prices, billing cycles, currencies, and associated features.
    Additionally, extract all key features mentioned on the page, regardless of pricing plan, along with their descriptions and which plans they are available in.

    Provide the output as a single JSON object with two top-level keys: "pricing" and "features".
    Each array element for "pricing" should strictly adhere to the following schema:
    ${JSON.stringify(pricingSchema.element.describe(), null, 2)}

    Each array element for "features" should strictly adhere to the following schema:
    ${JSON.stringify(featureSchema.element.describe(), null, 2)}

    If any information is not explicitly found, use 'null' or omit optional fields. Do not hallucinate.
    HTML Content:
    ${htmlContent}
  `;

  try {
    const result = await model.generateContent(prompt);
    const textResponse = result.response.text();
    const parsedJson = JSON.parse(textResponse);

    // Validate against Zod schemas
    const validatedPricing = pricingSchema.parse(parsedJson.pricing);
    const validatedFeatures = featureSchema.parse(parsedJson.features);

    // Store validated data in Supabase
    await supabase.from('extracted_pricing').insert({
      snapshot_id: snapshotId,
      data: validatedPricing,
    });
    await supabase.from('extracted_features').insert({
      snapshot_id: snapshotId,
      data: validatedFeatures,
    });

    console.log(`Successfully extracted and validated data for snapshot ${snapshotId}`);

  } catch (error) {
    console.error(`Error during Gemini extraction or validation for snapshot ${snapshotId}:`, error);
    // Implement retry logic or alert for manual review
  }
}

4.3. Pricing Change Alerts

Leverages the structured pricing data.

Logic:

Detection Trigger: After extractStructuredData successfully stores new pricing data, a new event pricing-extracted is published to Pub/Sub.
Alert Service: A Cloud Run service subscribes to pricing-extracted.
Historical Comparison: Retrieve the previous extracted_pricing record for the same competitor.
Delta Calculation: Compare the new pricing JSON with the old. Identify:
- New plans added.
- Plans removed.
- Price changes (numerical difference, percentage change).
- Feature changes within a plan.
- deep-diff or custom comparison logic can be used.
Thresholding: Configure user-defined thresholds (e.g., "alert if price changes by >5%").
Notification: If a significant change is detected, compose an email using Resend, detailing the change. Store the alert in Supabase (e.g., alerts table).

Pseudo-code for Pricing Comparison:

import { Resend } from 'resend';
const resend = new Resend(process.env.RESEND_API_KEY);

async function checkPricingChanges(newSnapshotId: string, previousSnapshotId: string | null, competitorId: string): Promise<void> {
  const newPricing = (await supabase.from('extracted_pricing').select('data').eq('snapshot_id', newSnapshotId).single()).data;
  if (!newPricing) return;

  if (!previousSnapshotId) {
    console.log(`First pricing extraction for ${competitorId}. No previous data to compare.`);
    // Optionally send an "initial extraction" alert
    return;
  }

  const oldPricing = (await supabase.from('extracted_pricing').select('data').eq('snapshot_id', previousSnapshotId).single()).data;
  if (!oldPricing) {
    console.warn(`Could not find previous pricing data for ${competitorId}/${previousSnapshotId}`);
    return;
  }

  let changes: string[] = [];

  // Implement detailed comparison logic here.
  // Example: Find changed prices
  newPricing.forEach((newPlan: any) => {
    const oldPlan = oldPricing.find((op: any) => op.planName === newPlan.planName);
    if (oldPlan) {
      if (newPlan.price !== oldPlan.price || newPlan.billingCycle !== oldPlan.billingCycle) {
        changes.push(`${newPlan.planName}: Price changed from ${oldPlan.price} ${oldPlan.billingCycle || ''} to ${newPlan.price} ${newPlan.billingCycle || ''}.`);
      }
      // Add logic for feature changes within a plan
    } else {
      changes.push(`New plan detected: ${newPlan.planName} at ${newPlan.price}.`);
    }
  });

  oldPricing.forEach((oldPlan: any) => {
    if (!newPricing.some((np: any) => np.planName === oldPlan.planName)) {
      changes.push(`Plan removed: ${oldPlan.planName}.`);
    }
  });

  if (changes.length > 0) {
    await supabase.from('alerts').insert({
      competitor_id: competitorId,
      alert_type: 'pricing_change',
      summary: `Pricing changes detected for ${competitorId}: ${changes.join(' ')}`,
      details: { changes },
      timestamp: new Date(),
    });

    // Send email via Resend
    await resend.emails.send({
      from: 'alerts@yourdomain.com',
      to: 'user@example.com', // Dynamically fetch user emails
      subject: `[Competitive Intelligence] Pricing Alert: ${competitorId}`,
      html: `<strong>Pricing Changes:</strong><ul>${changes.map(c => `<li>${c}</li>`).join('')}</ul>`,
    });
  }
}

4.4. Feature Comparison Matrix

This requires normalizing features to enable meaningful comparisons.

Logic:

Extraction: As part of the extractStructuredData process, extracted_features are stored.
Embedding Generation: For each extracted featureName and description, generate a vector embedding using Vertex AI Embedding API. Store these embeddings alongside the feature in Supabase (e.g., feature_embeddings table).
Similarity Search (Normalization):
- When displaying the matrix, or periodically, compare new feature embeddings against existing "canonical" features (user-defined or auto-clustered).
- Use Vertex AI Vector Search (Matching Engine) to find the most similar existing feature.
- Allow users to manually map similar features or confirm auto-suggested mappings. This creates a normalized_feature_id.
Matrix Generation (Frontend): The Next.js frontend queries extracted_features and normalized_features. It groups features by their normalized_feature_id and then displays availability across different competitors and their plans.

Schema (Supabase):

competitors table
extracted_features table (link to snapshot_id, competitor_id, feature_name, description, availability, embedding_vector)
normalized_features table (id, canonical_name, user_defined, embedding_vector)
feature_mappings table (extracted_feature_id, normalized_feature_id)

4.5. AI-Generated Summaries

Concise, human-readable summaries of detected changes.

Logic:

Trigger: After any significant change (pricing, new features, content updates) is processed and stored.
Context Assembly: Gather relevant data:
- Raw HTML diff (if available).
- Structured pricing changes.
- Structured new/removed features.
- Previous and current page snapshots.

Gemini Prompt:

"You are an expert market analyst summarizing key competitive intelligence.
Given the following detected changes for Competitor [Competitor Name] on [URL] (from [Old Snapshot Date] to [New Snapshot Date]):

[Detailed list of extracted pricing changes]
[Detailed list of new/removed features]
[Relevant snippets from HTML diff, if applicable]

Generate a concise summary in 3-5 bullet points, highlighting the most impactful changes, especially new features, pricing adjustments, or major strategic shifts. Focus on actionable insights.
"

Generation: Call Vertex AI Gemini.
Storage: Store the generated summary in Supabase, linked to the alert or snapshot_id. Display in the UI and include in weekly digests.

4.6. Weekly Digest Emails

Aggregated report of all competitive activity over the past week.

Logic:

Scheduled Trigger: Cloud Scheduler triggers a Cloud Run service weekly (e.g., Monday morning).
Data Aggregation: The service queries Supabase for:
- All new alerts (pricing changes, major feature launches) from the last 7 days.
- All ai_generated_summaries from the last 7 days.
- New competitors added or pages monitored.
Email Content Generation: Construct a dynamic HTML email template. Iterate through the aggregated data, presenting it clearly:
- "Top 3 Changes This Week" (based on severity/impact score).
- Per-competitor breakdown with bulleted summaries.
- Links back to the dashboard for detailed analysis.
Email Sending: Use Resend to send the personalized digest to all subscribed users.

5. Gemini Prompting Strategy

Core Principles:

Role Definition: Always define Gemini's persona. This sets the context and tone for its responses.
- Example: "You are an expert market analyst specialized in competitive intelligence. Your task is to meticulously extract specific data points from web content."
Clear Task Specification: State precisely what needs to be done.
- Example: "Identify all pricing plans, their associated features, prices, and billing frequencies. Additionally, list all distinct product features mentioned on the page, irrespective of plan, along with a brief description and their availability across plans."
Contextualization: Provide all necessary information, including source URL, competitor name, and the full raw text content. Filtering out irrelevant HTML (scripts, styles) beforehand can improve focus.
- Example: "Analyze the following HTML content from [Competitor Name]'s pricing page ([URL]): \nhtml\n[HTML Content]\n"

Strict Output Format Specification (JSON_MODE): This is critical.

Declare JSON_MODE: Ensure the API call explicitly requests JSON. Vertex AI's responseMimeType: 'application/json' in generationConfig is key.
Provide JSON Schema: Embed a clear, well-defined JSON schema directly in the prompt. This acts as a contract.
- Use description fields within the schema to guide Gemini on the expected content for each field.
- Specify type for all fields (string, number, boolean, array, object).
- Define optionality.

Example Snippet for Pricing Schema in Prompt:

{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "planName": { "type": "string", "description": "The name of the pricing plan (e.g., 'Basic', 'Premium')." },
      "monthlyPrice": { "type": ["number", "null"], "description": "The numeric monthly price if available, otherwise null." },
      "yearlyPrice": { "type": ["number", "null"], "description": "The numeric annual price if available, otherwise null." },
      "currency": { "type": "string", "description": "The currency symbol or code (e.g., '$', 'USD')." },
      "featuresIncluded": { "type": "array", "items": { "type": "string" }, "description": "A list of key features included in this plan." }
    },
    "required": ["planName", "currency"]
  }
}

Constraint Definition: Explicitly state what not to do, or what to do if information is missing.
- Examples:
  - "Do not invent information. If a piece of data is not explicitly present, return null for that field or omit optional fields."
  - "Only extract pricing tiers and features visible to regular users, ignore any 'enterprise contact us' forms unless specified."
  - "Prioritize the most current pricing structure if multiple are present."
Few-Shot Examples (Conditional): For highly nuanced extractions or when dealing with diverse webpage layouts, providing 1-2 examples of input text and their desired JSON output can significantly improve accuracy. This is particularly useful if the base model struggles with edge cases.
- Caveat: Increases token usage and prompt length, so use judiciously.
Iterative Refinement:
- Test with diverse data: Don't just test with one competitor.
- Analyze failures: If Gemini provides incorrect JSON or misses data, examine the specific input and prompt.
- Adjust Prompt:
  - Add more specific instructions.
  - Refine schema descriptions.
  - Add negative constraints (e.g., "Exclude X if present").
  - Consider preprocessing the input HTML (e.g., removing boilerplate, ads) to reduce noise.
- Version Control Prompts: Treat prompts as code; store them, version them, and track their performance.
Function Calling (Advanced): For complex workflows, define distinct "tools" (functions) Gemini can invoke. For instance, extract_pricing(html_content) and extract_features(html_content). Gemini can decide which tool to call based on the overall instruction, and each tool can have its own rigid JSON schema. This modularizes the extraction process.

By diligently applying these strategies, we ensure that the AI component consistently delivers high-quality, structured intelligence, which is the cornerstone of the Competitive Intelligence Tracker.

6. Deployment & Scaling

The deployment and scaling strategy focuses on leveraging serverless technologies for cost-efficiency and automatic scaling, while ensuring robustness and observability.

6.1. Local Development

Frontend: Standard npm run dev for the Next.js application. Hot reloading ensures a smooth development experience.
Backend (APIs/Workers): Cloud Run services can be run locally using gcloud emulators run or directly as Node.js processes, simulating their production environment.
Database: Supabase offers a local development setup using Docker Compose, providing a local Postgres instance, Auth, and Storage. This mirrors the production database schema.
AI: Use mocked responses for Gemini during initial development or directly call the Vertex AI API with development keys.
Resend: Use a development email service like Mailtrap or Ethereal for local email testing.

6.2. CI/CD Pipeline

Version Control: GitHub or Google Cloud Source Repositories.
Automated Builds: GitHub Actions or Cloud Build.
- Linting & Formatting: Ensure code quality and consistency (ESLint, Prettier).
- Unit & Integration Tests: Run tests for Next.js components, API routes, and core business logic (e.g., data comparison functions).
- Docker Builds: Build Docker images for Cloud Run services (scrapers, AI processors).
- Deployment:
  - Next.js Frontend: Deploy to Vercel, Cloudflare Pages, or Google Cloud Run (as a static-site/SSR container).
  - Cloud Run Services: Deploy Docker images to Cloud Run.
  - Supabase Database Migrations: Apply schema migrations using Supabase CLI or migrate command within CI/CD.

6.3. Database Scaling (Supabase Postgres)

Vertical Scaling: Supabase offers various compute add-ons for increasing CPU and RAM as load grows.
Read Replicas: For read-heavy workloads (e.g., dashboard queries, digest generation), provision read replicas to offload the primary database.
Connection Pooling: Supabase integrates PgBouncer, which efficiently manages database connections, preventing connection saturation under high load.
Row Level Security (RLS): Crucial for multi-tenant applications, RLS ensures users only access data they are authorized for, handled efficiently at the database level.
Indexing: Proactively create appropriate indexes on frequently queried columns (e.g., competitor_id, timestamp, url) to optimize query performance.

6.4. AI Scaling (Vertex AI)

Automatic Scaling: Vertex AI services (Gemini, Vector Search, Embeddings) inherently scale based on demand. There's no manual scaling configuration required for the AI models themselves.
Quota Management: Monitor and request quota increases for Vertex AI API calls as user base and monitoring scope grow.
Batch Processing: For large volumes of text, optimize API calls by batching multiple documents for embedding or summarization where possible, reducing overhead and potentially cost.
Caching: Cache AI responses for identical inputs (e.g., summaries of static content) to reduce redundant calls and costs.

6.5. Data Ingestion Scaling (Scrapers)

Horizontal Scaling with Cloud Run: Scraper services are deployed as Cloud Run containers. Triggering them via Pub/Sub allows for massive horizontal scaling. Each Pub/Sub message can represent a single scraping job, leading to many concurrent scraper instances.
Distributed Task Queue (Pub/Sub): Pub/Sub acts as a robust message broker, distributing scraping tasks across multiple Cloud Run instances. It provides retry mechanisms and guarantees message delivery.
Rate Limiting & Concurrency: Implement rate limiting in scraper services to avoid overwhelming target websites. Configure Cloud Run concurrency for scraper containers appropriately.
Proxy Rotation (Advanced): For aggressive scraping or to mitigate IP blocking, integrate a third-party proxy service with IP rotation capabilities.
Smart Scraping: Only re-scrape pages that are likely to have changed. Use change detection early in the pipeline to avoid unnecessary AI processing.

6.6. Monitoring & Observability

Google Cloud Monitoring & Logging: Centralized logging for all Cloud Run services, Pub/Sub, and Cloud Scheduler. Set up dashboards and alerts for:
- Scraping success rates (e.g., 200 OK vs. 4xx/5xx errors).
- AI processing latency and error rates.
- Database connection utilization and query performance.
- Alert delivery success/failure.
- Cloud Run instance count and CPU/memory usage.
Error Reporting: Automatically capture and group application errors from Cloud Run services.
Custom Metrics: Instrument code to emit custom metrics for business-specific KPIs, such as "number of new pricing changes detected per day."
Uptime Monitoring: External checks for the Next.js application and critical API endpoints.

6.7. Security

End-to-End Encryption: HTTPS for all client-server communication. Data at rest encryption provided by Supabase and Google Cloud Storage.
Authentication & Authorization: Supabase Auth for user management. Implement Row Level Security (RLS) on Supabase tables to enforce data isolation between users/tenants.
API Key Management: Store sensitive API keys (Vertex AI, Resend) securely in Google Cloud Secret Manager. Access them securely from Cloud Run services via environment variables.
Input Sanitization: Validate and sanitize all user inputs to prevent SQL injection, XSS, and other vulnerabilities.
Least Privilege: Configure IAM roles for Cloud Run services and other GCP resources with the minimum necessary permissions.
Vulnerability Scanning: Regularly scan dependencies and Docker images for known vulnerabilities.

6.8. Cost Optimization

Serverless First: Cloud Run, Pub/Sub, Cloud Scheduler, and Vertex AI are "pay-per-use" services, automatically scaling down to zero when not in use, significantly reducing idle costs.
Efficient AI Calls: Optimize Gemini prompts to be concise, minimize token usage, and batch requests where possible. Cache AI responses for repeated queries.
Smart Scraping: Prioritize scraping based on perceived change likelihood or user-defined importance. Avoid redundant scraping of unchanged content.
Database Sizing: Start with a smaller Supabase instance and scale up as needed, monitoring resource utilization.
Storage Lifecycles: Implement lifecycle rules for Supabase Storage to automatically delete or archive old raw HTML snapshots after a defined period (e.g., 3-6 months), reducing storage costs.

1. The Business Problem (Why build this?)

2. Solution Overview

3. Architecture & Tech Stack Justification

4. Core Feature Implementation Guide

4.1. Data Ingestion Pipeline (Scraping & Change Detection)

4.2. Guaranteed JSON Outputs (via Gemini & Vertex AI)

4.3. Pricing Change Alerts

4.4. Feature Comparison Matrix

4.5. AI-Generated Summaries

4.6. Weekly Digest Emails

5. Gemini Prompting Strategy

6. Deployment & Scaling

6.1. Local Development

6.2. CI/CD Pipeline

6.3. Database Scaling (Supabase Postgres)

6.4. AI Scaling (Vertex AI)

6.5. Data Ingestion Scaling (Scrapers)

6.6. Monitoring & Observability

6.7. Security

6.8. Cost Optimization

Core Capabilities

Technology Stack

Ready to build?

Competitive Intelligence Tracker

1. The Business Problem (Why build this?)

2. Solution Overview

3. Architecture & Tech Stack Justification

4. Core Feature Implementation Guide

4.1. Data Ingestion Pipeline (Scraping & Change Detection)

4.2. Guaranteed JSON Outputs (via Gemini & Vertex AI)

4.3. Pricing Change Alerts

4.4. Feature Comparison Matrix

4.5. AI-Generated Summaries

4.6. Weekly Digest Emails

5. Gemini Prompting Strategy

6. Deployment & Scaling

6.1. Local Development

6.2. CI/CD Pipeline

6.3. Database Scaling (Supabase Postgres)

6.4. AI Scaling (Vertex AI)

6.5. Data Ingestion Scaling (Scrapers)

6.6. Monitoring & Observability

6.7. Security

6.8. Cost Optimization

Core Capabilities

Technology Stack

Ready to build?