Project Blueprint: KYC Document Verifier
Subtitle: Extract and match data from IDs and utility bills
1. The Business Problem (Why build this?)
In today's highly regulated financial landscape, "Know Your Customer" (KYC) processes are not merely a compliance checkbox but a critical defense against fraud, money laundering, and terrorist financing. For banks, fintech companies, cryptocurrency exchanges, and various regulated industries, robust KYC procedures are non-negotiable. However, traditional KYC methods are plagued by inefficiencies and significant operational challenges:
- Manual & Error-Prone Verification: Relying on human agents to manually review, extract, and cross-reference data from identity documents (IDs) and utility bills is slow, prone to human error, and inconsistent. This leads to higher operational costs and a sub-optimal customer experience.
- Slow Onboarding & Customer Churn: The verification bottleneck directly impacts customer onboarding speed. Protracted verification times can frustrate potential customers, leading to abandonment and lost revenue opportunities. In competitive markets, speed to service is a key differentiator.
- High Operational Costs: Each manual review incurs a cost, encompassing personnel salaries, training, and infrastructure. As transaction volumes grow, these costs scale linearly, becoming unsustainable for rapidly expanding businesses.
- Increased Fraud Risk: Manual processes are more susceptible to sophisticated fraud attempts, including doctored documents or identity impersonation. The inability to quickly and accurately cross-reference data points across multiple documents can allow fraudulent actors to slip through the net.
- Compliance Penalties & Reputational Damage: Non-compliance with evolving AML (Anti-Money Laundering) and KYC regulations can result in severe financial penalties, regulatory sanctions, and irreparable damage to a company's brand and reputation.
- Scalability Limitations: Manual verification systems do not scale efficiently with business growth. Hiring and training more staff is a slow and expensive process, creating a choke point for expansion into new markets or handling increased transaction volumes.
- Lack of Auditability & Transparency: Maintaining a comprehensive, unalterable audit trail of every verification step, decision, and extracted data point is challenging with manual processes. This is crucial for regulatory scrutiny and internal investigations.
The imperative is clear: automate and enhance the KYC process to achieve higher accuracy, speed, compliance, and scalability, all while significantly improving the customer experience and reducing operational overhead.
2. Solution Overview
The KYC Document Verifier is a sophisticated software application designed to intelligently automate the extraction, matching, and verification of key personal data from government-issued identification documents and utility bills. Leveraging cutting-edge AI, the system aims to streamline the onboarding process, enhance security, and ensure regulatory compliance.
Core Functionality:
- Secure Document Upload: A user-friendly interface for customers to securely upload images or PDFs of their identity documents (e.g., passport, driving license, national ID) and recent utility bills (e.g., electricity, water, gas).
- AI-Powered Data Extraction (Gemini 1.5 Pro): Utilizing Google's Gemini 1.5 Pro's multi-modal capabilities to perform advanced Optical Character Recognition (OCR) and semantic understanding. This extracts structured data fields (e.g., name, date of birth, address, document number, utility company, billing period) from the uploaded images with high accuracy.
- Cross-Document Data Matching Logic: An intelligent engine that compares and cross-references extracted data points across different documents. For instance, it verifies if the name and address on the ID match those on the utility bill.
- Verification Scoring & Triage: A configurable scoring mechanism that assigns a verification score based on the accuracy and consistency of matched data, the presence of all required documents, and any detected anomalies. This score determines the verification outcome: "Auto-Approved," "Requires Manual Review," or "Rejected."
- Secure Document Storage: Encrypted, compliant storage for raw uploaded documents and extracted metadata, ensuring data privacy and regulatory adherence.
- Intuitive Review Interface: For compliance officers, a dashboard to review verification outcomes, inspect extracted data, view original documents (with appropriate access controls), and manually override decisions if necessary.
- Audit Trail: A comprehensive log of every action, data extraction, matching decision, and user interaction, ensuring full transparency and traceability for compliance audits.
Target Users:
- Applicants/Customers: Individuals undergoing the KYC process, experiencing a seamless and swift document submission.
- Compliance Officers/Risk Analysts: The primary internal users who will leverage the system's insights, review flagged cases, and make final decisions.
- Onboarding Teams: Utilizing the system to accelerate customer onboarding flows and reduce manual workload.
Key Differentiators:
- Unparalleled Speed: Automates a process that typically takes hours or days, reducing it to minutes.
- Enhanced Accuracy: Leverages state-of-the-art AI to minimize human error in data extraction and matching.
- Scalability: Designed to handle increasing volumes of verification requests without proportional increases in operational staff.
- Auditability: Provides a clear, tamper-proof record of all verification activities.
- Configurability: Allows institutions to tailor matching rules and scoring thresholds to their specific risk appetite and regulatory requirements.
3. Architecture & Tech Stack Justification
The KYC Document Verifier will be built on a robust, scalable, and secure modern cloud-native architecture, leveraging Google Cloud Platform (GCP) services for optimal performance, reliability, and ease of management.
Overall Architecture Diagram (Conceptual Flow):
[User Browser/Mobile App]
|
| (1. Upload Doc via React Dropzone)
V
[Next.js Frontend (React)] ----> [Next.js API Routes (Backend)]
| |
| (2. Initial validation, store raw file, trigger AI processing)
V V
[Google Cloud Storage] <-------------------- [Google Cloud Run (Backend Service)]
(Encrypted Raw Docs) |
^ | (3. Send image + prompt to Gemini)
| (4. Store extracted data, audit logs) V
[Google Cloud SQL (PostgreSQL)] <------------ [Gemini 1.5 Pro API]
|
| (5. Match data, calculate score)
V
[Cloud Run (Backend Service)]
|
| (6. Update verification status, notify frontend)
V
[Google Cloud SQL (PostgreSQL)]
|
| (7. Frontend polls/receives status update)
V
[Next.js Frontend (React)]
Tech Stack Justification:
-
Frontend: Next.js (React) + Tailwind CSS + React Dropzone
- Next.js: A full-stack React framework that enables server-side rendering (SSR), static site generation (SSG), and API routes. This provides excellent performance, improved SEO (though less critical for an internal tool), and a unified development experience for both frontend and backend logic (for simpler API needs). Its file-system-based routing simplifies page creation.
- React: The industry-standard library for building dynamic and responsive user interfaces. Its component-based architecture facilitates modularity and reusability.
- Tailwind CSS: A utility-first CSS framework. It significantly accelerates UI development by providing a comprehensive set of pre-defined utility classes, allowing for rapid styling directly in markup without writing custom CSS, ensuring consistency and a clean design.
- React Dropzone: A popular and robust library that simplifies the implementation of drag-and-drop file upload functionality, providing a great user experience and handling complexities like file previews and validation.
-
Backend: Next.js API Routes (Node.js) / Google Cloud Run
- While Next.js API routes are suitable for simple API endpoints, for complex business logic, long-running AI inference orchestration, and background processing, a dedicated serverless service is preferred.
- Google Cloud Run: A fully managed compute platform that allows deploying containerized applications. It's ideal for this project because:
- Serverless: Auto-scales from zero to thousands of instances based on demand, meaning we only pay when requests are being processed. This is highly cost-efficient for variable workloads.
- Container-based: Offers flexibility to use any language (Node.js, Python, Go, etc.) and package dependencies, providing portability and consistency across environments.
- Rapid Deployment: Quick to deploy and manage, integrating seamlessly with CI/CD pipelines.
- Scalability: Handles concurrent requests efficiently, crucial for high-volume document processing.
-
AI/ML: Gemini 1.5 Pro
- Gemini 1.5 Pro: Google's leading multi-modal AI model. Its core strength lies in its ability to process and understand vast amounts of information across different modalities, including images and text, with a massive context window.
- Document Understanding: Perfect for OCR and semantic extraction from varied document layouts (IDs, utility bills). It can identify and extract specific fields even from semi-structured or unstructured text within images.
- High Accuracy: State-of-the-art performance in natural language understanding and image comprehension.
- Structured Output: Capable of generating responses in a structured format (e.g., JSON), which is crucial for automated processing.
- Google Cloud Integration: Seamless integration with other GCP services, simplified API access, and robust infrastructure.
- Gemini 1.5 Pro: Google's leading multi-modal AI model. Its core strength lies in its ability to process and understand vast amounts of information across different modalities, including images and text, with a massive context window.
-
Database: Google Cloud SQL for PostgreSQL
- PostgreSQL: A powerful, open-source relational database known for its robustness, reliability, rich feature set, and strong support for data integrity (ACID compliance). It's excellent for storing structured data like user accounts, verification attempts, extracted data, audit logs, and matching rules.
- Google Cloud SQL: A fully managed relational database service.
- Managed Service: Google handles patching, backups, replication, and scaling, reducing operational overhead.
- High Availability: Offers built-in failover mechanisms.
- Scalability: Easily scale CPU, memory, and storage resources as needed.
- Security: Strong security features, including encryption at rest and in transit, and network isolation.
-
Storage: Google Cloud Storage (GCS)
- Cloud Storage: A highly scalable, durable, and cost-effective object storage service.
- Secure Document Storage: Ideal for storing raw uploaded documents (IDs, utility bills) in an encrypted, immutable, and versioned manner.
- Cost-Effective: Tiered storage classes (Standard, Nearline, Coldline, Archive) allow for cost optimization based on access frequency.
- Compliance: Supports various compliance requirements (e.g., data residency, retention policies).
- Access Control: Fine-grained access control using IAM policies and signed URLs for temporary, restricted access.
- Cloud Storage: A highly scalable, durable, and cost-effective object storage service.
-
Authentication & Authorization: Firebase Authentication / Google Cloud IAM
- Firebase Authentication: Provides ready-to-use authentication services (email/password, social logins). For internal tools, integration with Google accounts is straightforward.
- Google Cloud IAM: For controlling access to GCP resources (Cloud Run, Cloud SQL, GCS, Gemini API). Principle of least privilege will be enforced.
4. Core Feature Implementation Guide
This section details the implementation strategy for the critical features, including architecture advice, pipeline designs, and pseudo-code snippets.
A. Secure Document Upload (Frontend: Next.js/React Dropzone)
The client-side upload component handles file selection, basic validation, and initiates the upload process to the backend.
- User Interface: A
<Dropzone />component fromreact-dropzoneprovides a visual drag-and-drop area. - Client-Side Validation:
- File Type: Restrict to common image (JPG, PNG) and PDF formats.
- File Size: Implement limits (e.g., 10MB per file) to prevent large uploads from overwhelming the system or incurring excessive processing costs.
- Number of Files: Enforce expected document count (e.g., one ID, one utility bill).
- Upload Mechanism:
- Upon file selection, display loading indicators.
- Send files to a Next.js API route (
/api/upload) usingFormDataandfetchoraxios. - Handle success and error states, providing clear user feedback.
Pseudo-code (Frontend - components/DocumentUploader.tsx):
// components/DocumentUploader.tsx
import React, { useCallback, useState } from 'react';
import { useDropzone } from 'react-dropzone';
import { toast } from 'react-toastify'; // For user feedback
interface DocumentUploaderProps {
onUploadSuccess: (docType: 'id' | 'bill', fileId: string) => void;
docType: 'id' | 'bill';
acceptedFileTypes: string[];
}
const DocumentUploader: React.FC<DocumentUploaderProps> = ({ onUploadSuccess, docType, acceptedFileTypes }) => {
const [isUploading, setIsUploading] = useState(false);
const onDrop = useCallback(async (acceptedFiles: File[]) => {
if (acceptedFiles.length === 0) {
toast.error('No files selected or file type not accepted.');
return;
}
if (acceptedFiles.length > 1) {
toast.error('Please upload only one file at a time.');
return;
}
const file = acceptedFiles[0];
if (file.size > 10 * 1024 * 1024) { // 10MB limit
toast.error('File size exceeds 10MB limit.');
return;
}
setIsUploading(true);
const formData = new FormData();
formData.append('document', file);
formData.append('docType', docType); // 'id' or 'bill'
try {
const response = await fetch('/api/upload', {
method: 'POST',
body: formData,
});
if (!response.ok) {
throw new Error(`Upload failed: ${response.statusText}`);
}
const data = await response.json();
toast.success(`${docType === 'id' ? 'ID' : 'Utility Bill'} uploaded successfully!`);
onUploadSuccess(docType, data.fileId); // fileId is a unique identifier from backend
} catch (error: any) {
console.error('Upload error:', error);
toast.error(`Error uploading ${docType === 'id' ? 'ID' : 'utility bill'}: ${error.message}`);
} finally {
setIsUploading(false);
}
}, [onUploadSuccess, docType]);
const { getRootProps, getInputProps, isDragActive } = useDropzone({
onDrop,
accept: { 'image/*': ['.jpeg', '.jpg', '.png'], 'application/pdf': ['.pdf'] }, // Example accepted types
maxFiles: 1,
});
return (
<div
{...getRootProps()}
className={`border-2 border-dashed rounded-lg p-6 text-center cursor-pointer
${isDragActive ? 'border-blue-500 bg-blue-50' : 'border-gray-300 bg-gray-50'}`}
>
<input {...getInputProps()} />
{isUploading ? (
<p>Uploading...</p>
) : (
isDragActive ? (
<p>Drop the {docType === 'id' ? 'ID document' : 'utility bill'} here ...</p>
) : (
<p>Drag 'n' drop {docType === 'id' ? 'ID document' : 'utility bill'} here, or click to select files</p>
)
)}
<em className="text-sm text-gray-500">(Only JPG, PNG, PDF files up to 10MB)</em>
</div>
);
};
export default DocumentUploader;
B. Backend Document Processing Pipeline (Backend: Google Cloud Run)
This is the core server-side logic responsible for handling uploads, orchestrating AI calls, and storing results. It should be designed for asynchronous processing to prevent timeouts.
Pipeline Flow:
- Reception & Server-Side Validation: The
/api/uploadendpoint receives the file. Perform server-side validation (file type, size) as client-side can be bypassed. - Generate Unique ID: Assign a unique
verification_idanddocument_idfor each file and verification attempt. - Secure Storage (GCS): Upload the raw document to a designated, encrypted Google Cloud Storage bucket. This ensures raw data is stored securely and is accessible for audits. Store the GCS path in the database.
- Initiate Asynchronous Processing: Instead of immediate processing, publish a message to a Google Cloud Pub/Sub topic with the
document_idand GCS URI. This decouples the upload request from the long-running AI process, ensuring a responsive frontend. - Cloud Run Worker (Pub/Sub Subscriber): A separate Cloud Run service subscribes to the Pub/Sub topic.
- Download Document: Retrieve the document from GCS using the URI.
- AI Orchestration (Gemini 1.5 Pro):
- Preprocessing: If needed, convert PDF pages to images or optimize image quality for Gemini.
- Gemini API Call: Construct a detailed prompt (see Section 5) and send the image data to Gemini 1.5 Pro.
- Error Handling: Implement retries and robust error handling for API calls.
- Data Normalization & Storage:
- Parse Gemini's structured JSON output.
- Normalize data formats (e.g., dates to ISO 8601, addresses to a consistent format).
- Store extracted data in Cloud SQL (e.g.,
extracted_datatable linked todocument_id).
- Trigger Matching Logic: Publish another Pub/Sub message to trigger the data matching service once both ID and utility bill data are extracted for a
verification_id.
Pseudo-code (Backend - Next.js API Route for Upload):
// pages/api/upload.ts
import type { NextApiRequest, NextApiResponse } from 'next';
import { IncomingForm } from 'formidable';
import { Storage } from '@google-cloud/storage';
import { PubSub } from '@google-cloud/pubsub';
import { v4 as uuidv4 } from 'uuid';
import fs from 'fs'; // For temporary file operations
// Configure formidable to parse multipart/form-data
export const config = {
api: {
bodyParser: false, // Disable Next.js's default body parser
},
};
const storage = new Storage();
const pubsub = new PubSub();
const bucketName = process.env.GCS_BUCKET_NAME || 'kyc-documents-bucket';
const topicName = process.env.PUBSUB_TOPIC_NAME || 'document-processing-topic';
export default async function handler(req: NextApiRequest, res: NextApiResponse) {
if (req.method !== 'POST') {
return res.status(405).json({ message: 'Method Not Allowed' });
}
const form = new IncomingForm();
form.parse(req, async (err, fields, files) => {
if (err) {
console.error('Error parsing form:', err);
return res.status(500).json({ message: 'Error processing upload.' });
}
const documentFile = files.document?.[0];
const docType = fields.docType?.[0];
if (!documentFile || !docType) {
return res.status(400).json({ message: 'Missing document file or document type.' });
}
// Basic server-side validation
const allowedTypes = ['image/jpeg', 'image/png', 'application/pdf'];
if (!allowedTypes.includes(documentFile.mimetype || '')) {
return res.status(400).json({ message: 'Invalid file type. Only JPG, PNG, PDF are allowed.' });
}
if (documentFile.size > 10 * 1024 * 1024) { // 10MB
return res.status(400).json({ message: 'File size exceeds 10MB limit.' });
}
try {
const documentId = uuidv4();
const verificationId = fields.verificationId?.[0] || uuidv4(); // Create or use existing verification ID
const fileName = `${verificationId}/${documentId}-${docType}-${documentFile.originalFilename}`;
const gcsFile = storage.bucket(bucketName).file(fileName);
// Upload file to GCS
await gcsFile.upload(documentFile.filepath);
console.log(`File ${fileName} uploaded to GCS.`);
// Publish message to Pub/Sub for asynchronous processing
const messageData = {
documentId: documentId,
verificationId: verificationId,
docType: docType,
gcsUri: `gs://${bucketName}/${fileName}`,
};
const dataBuffer = Buffer.from(JSON.stringify(messageData));
await pubsub.topic(topicName).publishMessage({ data: dataBuffer });
console.log(`Message published to Pub/Sub for document ${documentId}.`);
// Clean up temporary file created by formidable
fs.unlink(documentFile.filepath, (unlinkErr) => {
if (unlinkErr) console.error('Error deleting temp file:', unlinkErr);
});
return res.status(200).json({ message: 'Document uploaded and processing initiated.', documentId, verificationId });
} catch (error) {
console.error('GCS upload or Pub/Sub publish error:', error);
return res.status(500).json({ message: 'Failed to upload document or initiate processing.' });
}
});
}
Pseudo-code (Cloud Run Worker - Pub/Sub Subscriber for AI Processing):
# main.py (Cloud Run service, Python example for Gemini integration)
import functions_framework
from google.cloud import storage, pubsub_v1
import google.generativeai as genai
import json
import os
import io
# Initialize clients
storage_client = storage.Client()
pubsub_publisher = pubsub_v1.PublisherClient()
genai.configure(api_key=os.environ.get("GEMINI_API_KEY"))
model = genai.GenerativeModel('gemini-1.5-pro')
# Database connection (using SQLAlchemy for ORM)
from sqlalchemy import create_engine, Column, String, JSON, DateTime
from sqlalchemy.orm import sessionmaker
from sqlalchemy.ext.declarative import declarative_base
from datetime import datetime
DATABASE_URL = os.environ.get("DATABASE_URL")
engine = create_engine(DATABASE_URL)
Session = sessionmaker(bind=engine)
Base = declarative_base()
class ExtractedData(Base):
__tablename__ = 'extracted_data'
document_id = Column(String, primary_key=True)
verification_id = Column(String, nullable=False)
doc_type = Column(String, nullable=False)
gcs_uri = Column(String, nullable=False)
extracted_fields = Column(JSON)
processing_status = Column(String, default='PENDING')
processed_at = Column(DateTime)
created_at = Column(DateTime, default=datetime.utcnow)
Base.metadata.create_all(engine) # Create table if it doesn't exist
@functions_framework.cloud_event
def process_document(cloud_event):
message = json.loads(base64.b64decode(cloud_event.data["message"]["data"]).decode("utf-8"))
document_id = message["documentId"]
verification_id = message["verificationId"]
doc_type = message["docType"]
gcs_uri = message["gcsUri"]
print(f"Processing document {document_id} from {gcs_uri}")
session = Session()
try:
# Download document from GCS
bucket_name, blob_name = gcs_uri.replace("gs://", "").split("/", 1)
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(blob_name)
# Read file into memory (for smaller files; for large PDFs, stream processing or temp file might be better)
doc_content = blob.download_as_bytes()
# Determine prompt based on document type
if doc_type == 'id':
prompt = get_id_prompt() # Defined in Section 5
elif doc_type == 'bill':
prompt = get_bill_prompt() # Defined in Section 5
else:
raise ValueError(f"Unknown document type: {doc_type}")
# Prepare image for Gemini (assuming it's an image or PDF convertible to image)
# For PDFs, you'd typically convert pages to images first using a library like pdf2image
image_part = {
"mime_type": blob.content_type, # e.g., "image/jpeg", "application/pdf"
"data": doc_content
}
# Call Gemini API
response = model.generate_content([prompt, image_part])
extracted_data_json = json.loads(response.text) # Assuming Gemini outputs strict JSON
# Store extracted data
extracted_entry = ExtractedData(
document_id=document_id,
verification_id=verification_id,
doc_type=doc_type,
gcs_uri=gcs_uri,
extracted_fields=extracted_data_json,
processing_status='COMPLETED',
processed_at=datetime.utcnow()
)
session.add(extracted_entry)
session.commit()
print(f"Successfully extracted data for document {document_id}.")
# After processing, trigger the matching logic if both ID and bill are processed
# This would involve querying the database for the other document.
# If both exist and are processed, publish a message to a 'matching-topic'.
# For simplicity, this step is omitted from pseudo-code but is a critical next step.
except Exception as e:
session.rollback()
print(f"Error processing document {document_id}: {e}")
# Update status to FAILED in DB and potentially send to a dead-letter queue
extracted_entry = session.query(ExtractedData).filter_by(document_id=document_id).first()
if extracted_entry:
extracted_entry.processing_status = 'FAILED'
session.commit()
finally:
session.close()
# Helper functions for prompts (content defined in Section 5)
def get_id_prompt():
# ... return ID prompt string ...
pass
def get_bill_prompt():
# ... return Bill prompt string ...
pass
C. Data Matching Logic
This component compares the extracted data from the ID and utility bill to determine consistency.
- Trigger: Activated when both the ID and utility bill for a given
verification_idhave been successfully processed and their data stored. This would typically be another Pub/Sub message. - Key Matching Fields:
- Name:
full_namefrom ID vs.customer_namefrom utility bill. - Date of Birth:
date_of_birthfrom ID. (No direct match on bill, but presence is crucial). - Address:
addressfrom ID vs.service_addressfrom utility bill.
- Name:
- Matching Algorithm:
- Exact Match: For sensitive fields like DOB (if provided) or document numbers (if relevant cross-referencing points exist).
- Fuzzy Matching (Names): Use algorithms like Levenshtein distance, Jaro-Winkler distance, or token sort ratio (fuzzywuzzy library in Python) to handle minor discrepancies (typos, abbreviations).
- Address Standardization: Addresses are notoriously tricky.
- Normalize addresses: Convert street suffixes (St, Street, Rd, Road), remove apartment numbers (or match separately), convert to uppercase.
- Compare normalized addresses using fuzzy matching. Consider geographic proximity if a full address database is available (out of scope for initial MVP).
- Date Formats: Ensure all dates are normalized to a consistent format (e.g., YYYY-MM-DD) before comparison.
- Rule Engine: Define configurable rules for comparison.
name_similarity_threshold: e.g., 85% Jaro-Winkler.address_similarity_threshold: e.g., 75% Levenshtein.dob_must_match_exactly: Boolean.document_expiry_check: ID not expired.
Pseudo-code (Cloud Run Worker - Matching Service):
# matching_service.py (Cloud Run service subscribed to 'matching-topic')
import functions_framework
from google.cloud import pubsub_v1
import json
import os
from datetime import datetime
from fuzzywuzzy import fuzz, process # Popular library for fuzzy string matching
from sqlalchemy.orm import sessionmaker
# Database connection details (same as above)
# ... imports for SQLAlchemy, ExtractedData model ...
# Add new model for VerificationResult
class VerificationResult(Base):
__tablename__ = 'verification_results'
verification_id = Column(String, primary_key=True)
id_document_id = Column(String)
bill_document_id = Column(String)
overall_score = Column(Integer)
status = Column(String) # e.g., 'AUTO_APPROVED', 'MANUAL_REVIEW', 'REJECTED'
match_details = Column(JSON) # Store details of each field match
created_at = Column(DateTime, default=datetime.utcnow)
updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
Base.metadata.create_all(engine)
def normalize_string(s: str) -> str:
return s.strip().lower().replace('.', '').replace(',', '')
def normalize_address(address: str) -> str:
# A more sophisticated normalization would involve geocoding or USPS API
normalized = address.lower()
normalized = normalized.replace('street', 'st').replace('road', 'rd').replace('avenue', 'ave')
normalized = re.sub(r'\W+', ' ', normalized) # Remove non-alphanumeric chars
return normalized.strip()
@functions_framework.cloud_event
def run_matching(cloud_event):
message = json.loads(base64.b64decode(cloud_event.data["message"]["data"]).decode("utf-8"))
verification_id = message["verificationId"]
session = Session()
try:
id_data_entry = session.query(ExtractedData).filter_by(verification_id=verification_id, doc_type='id').first()
bill_data_entry = session.query(ExtractedData).filter_by(verification_id=verification_id, doc_type='bill').first()
if not id_data_entry or not bill_data_entry:
print(f"Missing one or both documents for verification ID {verification_id}. Waiting for full data.")
return # Can add logic to retry or alert if docs are missing for too long
id_fields = id_data_entry.extracted_fields
bill_fields = bill_data_entry.extracted_fields
match_details = {}
overall_score = 0
max_score = 100 # Example
# 1. Name Matching
id_name = normalize_string(id_fields.get('full_name', ''))
bill_name = normalize_string(bill_fields.get('customer_name', ''))
name_similarity = fuzz.token_sort_ratio(id_name, bill_name)
match_details['name_match_score'] = name_similarity
if name_similarity >= 85: # Configurable threshold
overall_score += 30 # Points for good name match
match_details['name_match_status'] = 'STRONG_MATCH'
elif name_similarity >= 70:
overall_score += 15
match_details['name_match_status'] = 'MODERATE_MATCH'
else:
match_details['name_match_status'] = 'NO_MATCH'
overall_score -= 20 # Penalty
# 2. Address Matching
id_address = normalize_address(id_fields.get('address', ''))
bill_address = normalize_address(bill_fields.get('service_address', ''))
address_similarity = fuzz.token_sort_ratio(id_address, bill_address)
match_details['address_match_score'] = address_similarity
if address_similarity >= 80: # Configurable threshold
overall_score += 30
match_details['address_match_status'] = 'STRONG_MATCH'
elif address_similarity >= 60:
overall_score += 15
match_details['address_match_status'] = 'MODERATE_MATCH'
else:
match_details['address_match_status'] = 'NO_MATCH'
overall_score -= 20
# 3. Date of Birth Check (ID only)
# Assuming ID has DOB. If utility bill had it, compare.
id_dob_str = id_fields.get('date_of_birth', '')
dob_valid = False
if id_dob_str:
try:
# Basic date parsing; more robust validation needed
dob = datetime.strptime(id_dob_str, '%Y-%m-%d')
dob_valid = True
if (datetime.now().year - dob.year) < 18:
match_details['age_check_status'] = 'UNDERAGE'
dob_valid = False
overall_score -= 50 # Significant penalty
else:
match_details['age_check_status'] = 'PASSED'
overall_score += 20
except ValueError:
match_details['age_check_status'] = 'INVALID_FORMAT'
overall_score -= 10
else:
match_details['age_check_status'] = 'MISSING'
overall_score -= 10
# 4. ID Expiry Check
id_expiry_str = id_fields.get('expiry_date', '')
id_not_expired = False
if id_expiry_str:
try:
expiry_date = datetime.strptime(id_expiry_str, '%Y-%m-%d')
if expiry_date > datetime.now():
id_not_expired = True
match_details['id_expiry_status'] = 'VALID'
overall_score += 20
else:
match_details['id_expiry_status'] = 'EXPIRED'
overall_score -= 30
except ValueError:
match_details['id_expiry_status'] = 'INVALID_FORMAT'
overall_score -= 5
else:
match_details['id_expiry_status'] = 'MISSING'
overall_score -= 10
# Determine overall status based on score
status = 'MANUAL_REVIEW'
if overall_score >= 80: # High confidence
status = 'AUTO_APPROVED'
elif overall_score < 40: # Low confidence, likely rejection
status = 'REJECTED'
# Store results
verification_result = VerificationResult(
verification_id=verification_id,
id_document_id=id_data_entry.document_id,
bill_document_id=bill_data_entry.document_id,
overall_score=overall_score,
status=status,
match_details=match_details
)
session.add(verification_result)
session.commit()
print(f"Verification {verification_id} completed with status: {status}, score: {overall_score}")
# Update frontend via WebSocket or polling (not shown)
except Exception as e:
session.rollback()
print(f"Error in matching for verification ID {verification_id}: {e}")
# Update verification_result status to FAILED
finally:
session.close()
D. Verification Scoring
The scoring mechanism provides a quantifiable measure of confidence in the verification and determines the automatic action.
- Weighted Scoring: Each matching criterion (Name, Address, DOB validity, ID expiry) is assigned a weight or point value.
- Strong Match: +X points.
- Moderate Match: +Y points.
- No Match/Missing Field: -Z points (penalties).
- Critical Failure (e.g., Expired ID, Underage): -Large_Z points, potentially an immediate 'Rejected' status regardless of other scores.
- Thresholds:
- Auto-Approved: Score > 80 (e.g., all critical fields match strongly).
- Manual Review: Score between 40 and 80 (e.g., some discrepancies, ambiguous fields, or critical fields missing).
- Rejected: Score < 40 (e.g., significant mismatches, expired ID, suspected fraud).
- Audit Trail: Every calculation, rule application, score, and final status must be logged in the
verification_resultstable for auditability.
E. Secure Document Storage (Google Cloud Storage)
Data security and privacy are paramount.
- Encryption at Rest: GCS automatically encrypts data at rest using Google-managed encryption keys by default. Customer-managed encryption keys (CMEK) can be used for an additional layer of control.
- Encryption in Transit: All communication with GCS happens over TLS (Transport Layer Security) encrypted channels.
- Access Control (IAM):
- Implement granular IAM policies for the GCS bucket.
- Only the Cloud Run processing service should have read/write access to the raw documents bucket.
- Compliance officers or authorized personnel will access documents via a controlled API endpoint that generates time-limited, signed URLs, preventing direct public access.
- Principle of Least Privilege: Ensure each service account has only the minimum necessary permissions.
- Data Retention & Lifecycle Management:
- Configure GCS bucket lifecycle rules to automatically delete documents after a defined retention period (e.g., 7 years, based on compliance requirements) or transition them to cheaper cold storage tiers.
- Implement versioning on the bucket to recover from accidental deletions or modifications.
- Compliance: Design with GDPR, CCPA, and other relevant data privacy regulations in mind, particularly concerning data residency and data subject rights. Documents should be stored in the appropriate regional bucket.
5. Gemini Prompting Strategy
The quality of extracted data directly hinges on effective prompt engineering for Gemini 1.5 Pro. The goal is to instruct Gemini to act as a highly specialized OCR and data extraction agent, providing structured output for downstream processing.
Key Principles for Gemini Prompts:
- Role Setting: Start by explicitly defining Gemini's role to align its behavior.
- Clear Task Definition: State precisely what data needs to be extracted.
- Strict Output Format: Demand JSON output with specific keys and expected data types. This is crucial for programmatic parsing.
- Field Enumeration & Examples: List all required fields explicitly. Provide examples of the expected values for each field.
- Error Handling Instructions: Tell Gemini what to do if a field is missing, illegible, or ambiguous (e.g., return
null,N/A). - Multi-modality (Implicit): Gemini 1.5 Pro inherently handles the image. The prompt guides its interpretation of the image's textual and spatial content.
- Safety & Responsible AI: While not explicitly in the prompt, be mindful of responsible AI practices when processing sensitive data. Gemini has built-in safety filters; ensure your use case is aligned.
Example Prompt (ID Card Extraction):
You are an expert document parser specializing in extracting information from government-issued identity cards. Your task is to accurately read the provided ID card image and extract the specified fields.
**Instructions:**
1. Carefully examine the entire document for the requested information.
2. If a field is present and legible, extract its value accurately.
3. If a field is missing, illegible, or cannot be confidently identified, return its value as `null`.
4. **Output should be a single JSON object ONLY.** Do NOT include any additional text, explanations, or markdown outside the JSON.
5. All dates should be formatted as `YYYY-MM-DD`.
**Required Fields:**
* `document_type`: (e.g., "National ID Card", "Driver's License", "Passport")
* `full_name`: The full legal name of the cardholder.
* `date_of_birth`: Date of birth of the cardholder.
* `document_number`: The unique identification number of the document.
* `nationality`: The nationality of the cardholder, if available.
* `gender`: Gender of the cardholder (e.g., "M", "F", "Other").
* `place_of_birth`: Place of birth of the cardholder.
* `issue_date`: Date the document was issued.
* `expiry_date`: Date the document expires.
* `address`: The full residential address of the cardholder as listed on the ID. Concatenate all address lines into a single string.
* `issuing_authority`: The entity that issued the document.
* `photo_present`: Boolean (true if a clear photo of the cardholder is visible, false otherwise).
**Example Desired JSON Output:**
```json
{
"document_type": "Driver's License",
"full_name": "JOHN DOE",
"date_of_birth": "1990-05-15",
"document_number": "D123456789",
"nationality": "USA",
"gender": "M",
"place_of_birth": "NEW YORK, NY",
"issue_date": "2020-01-20",
"expiry_date": "2025-01-19",
"address": "123 MAIN ST, APT 4B, ANYTOWN, CA 90210",
"issuing_authority": "CALIFORNIA DMV",
"photo_present": true
}
**Example Prompt (Utility Bill Extraction):**
```text
You are an expert document parser specializing in extracting information from utility bills. Your task is to accurately read the provided utility bill image and extract the specified fields.
**Instructions:**
1. Carefully examine the entire bill for the requested information.
2. If a field is present and legible, extract its value accurately.
3. If a field is missing, illegible, or cannot be confidently identified, return its value as `null`.
4. **Output should be a single JSON object ONLY.** Do NOT include any additional text, explanations, or markdown outside the JSON.
5. All dates should be formatted as `YYYY-MM-DD`.
**Required Fields:**
* `utility_company_name`: The name of the company issuing the bill.
* `customer_name`: The full name of the primary account holder.
* `service_address`: The address where the utility service is provided. Concatenate all address lines into a single string.
* `billing_period_start`: The start date of the billing cycle.
* `billing_period_end`: The end date of the billing cycle.
* `issue_date`: The date the bill was generated/issued.
* `due_date`: The date by which the payment is due.
* `account_number`: The unique account identifier for the customer.
* `total_amount_due`: The total amount due for the billing period, including currency symbol.
* `service_type`: The type of utility service (e.g., "Electricity", "Water", "Gas", "Internet").
**Example Desired JSON Output:**
```json
{
"utility_company_name": "NATIONAL POWER & LIGHT",
"customer_name": "JANE DOE",
"service_address": "456 OAK AVE, SUITE 100, TOWNSVILLE, NY 10001",
"billing_period_start": "2023-10-01",
"billing_period_end": "2023-10-31",
"issue_date": "2023-11-05",
"due_date": "2023-11-20",
"account_number": "9876543210",
"total_amount_due": "$125.75",
"service_type": "Electricity"
}
Iterative Refinement: Prompt engineering is an iterative process. It's crucial to:
- Test with diverse data: Use a wide variety of ID cards (different countries, states, designs) and utility bills (different providers, layouts).
- Analyze failures: If Gemini misses a field or provides incorrect data, adjust the prompt to explicitly guide it for that specific scenario.
- Monitor performance: Track accuracy metrics (precision, recall for each field) to continually improve the prompts.
- Few-shot examples: For very complex or highly variable documents, adding a few in-context examples (image + expected JSON) directly within the prompt can significantly boost performance.
6. Deployment & Scaling
Leveraging Google Cloud Platform (GCP) provides a robust, scalable, and secure environment for deploying and managing the KYC Document Verifier.
A. Deployment Strategy:
-
Frontend (Next.js):
- Hosting: Firebase Hosting or Google Cloud Storage + Cloud CDN. Firebase Hosting offers global CDN, custom domains, SSL, and seamless CI/CD integration.
- Build Process: Next.js build (
next build) generates static assets and server-side bundles. - CI/CD: Use Cloud Build to automatically build and deploy the Next.js application to Firebase Hosting upon code pushes to the main branch.
-
Backend (Cloud Run Services):
- Containerization: Each distinct backend service (e.g.,
document-processing-worker,matching-service) will be containerized using Docker. This ensures environment consistency. - Deployment: Deploy Docker images to Google Cloud Run. Cloud Run handles all infrastructure provisioning, scaling, and management.
- CI/CD: Cloud Build will trigger on code pushes, build the Docker image, push it to Google Container Registry (GCR) or Artifact Registry, and then deploy the new image to the respective Cloud Run service.
- Containerization: Each distinct backend service (e.g.,
-
Database (Cloud SQL for PostgreSQL):
- Instance Setup: Provision a Cloud SQL PostgreSQL instance with appropriate machine type, storage, and high availability settings (if critical for production).
- Schema Migration: Use database migration tools (e.g., Flyway, Alembic, or Prisma Migrations) to manage schema changes in a version-controlled manner.
- Connection: Cloud Run services will connect to Cloud SQL securely using Cloud SQL Proxy, which handles secure, authorized connections without exposing the database publicly.
-
Storage (Google Cloud Storage):
- Bucket Creation: Create dedicated GCS buckets for raw documents and potentially for processed or temporary files.
- IAM: Configure IAM roles and permissions to ensure only authorized services (e.g., Cloud Run workers) can access the buckets, adhering to the principle of least privilege.
B. Scalability Considerations:
- Cloud Run's Auto-Scaling: Cloud Run services automatically scale horizontally from zero to hundreds or thousands of container instances based on incoming request load. This handles fluctuating demand efficiently.
- Asynchronous Processing with Pub/Sub:
- Decoupling: Pub/Sub acts as a highly scalable message queue, decoupling the document upload endpoint from the CPU-intensive AI processing and matching steps.
- Load Leveling: Absorbs spikes in upload traffic, ensuring the AI processing workers receive messages at a manageable rate, preventing backpressure.
- Resilience: Messages are persisted until acknowledged, guaranteeing delivery even if worker instances fail.
- Cloud SQL Scaling:
- Vertical Scaling: Easily upgrade CPU and memory for the Cloud SQL instance as database load increases.
- Read Replicas: For read-heavy workloads (e.g., dashboard querying verification results), provision read replicas to distribute read traffic and reduce load on the primary instance.
- Gemini 1.5 Pro: The Gemini API is designed for high throughput and automatically scales to meet demand. Monitor usage and quotas.
- Frontend Caching: Cloud CDN for static assets (JavaScript, CSS, images) improves loading times and reduces origin server load.
C. Monitoring, Logging & Alerting:
- Google Cloud Logging: All Cloud Run, Cloud SQL, and other GCP service logs are automatically collected in Cloud Logging. Centralize logs for easier debugging and auditing.
- Google Cloud Monitoring:
- Dashboards: Create custom dashboards to visualize key metrics (e.g., Cloud Run request counts, latency, error rates; Cloud SQL CPU utilization, active connections; Pub/Sub message backlog).
- Alerting: Set up alerts for critical conditions: high error rates, long Pub/Sub message backlogs, Cloud SQL CPU saturation, or failed AI calls.
- Application-Level Metrics: Instrument Cloud Run services with application-specific metrics (e.g., document processing time, matching accuracy, verification scores) and export them to Cloud Monitoring.
D. Security:
- Google Cloud IAM: Granular, role-based access control for all GCP resources. Enforce least privilege.
- Secret Manager: Store sensitive information (Gemini API keys, database credentials, any other API keys) in Google Cloud Secret Manager. Cloud Run services can securely access these secrets at runtime without hardcoding them.
- VPC Service Controls (VPC-SC): For highly sensitive data, implement VPC Service Controls to create security perimeters around GCP resources, restricting data movement and reducing data exfiltration risks.
- Network Security:
- Cloud SQL instances should not be publicly accessible; connect via Cloud SQL Proxy or private IP.
- Cloud Run services can be configured to only allow internal traffic or traffic from specific IP ranges if needed.
- Google Cloud Armor (WAF) can be placed in front of frontend services for DDoS protection and web application firewall capabilities.
- Data Encryption: Ensure all data is encrypted at rest (GCS, Cloud SQL) and in transit (TLS).
- Regular Security Audits: Conduct periodic security audits and penetration testing.
