INSIGHT Hero
Research / Software Development · 3 years

INSIGHT — INtelligent Synthesis and Generation of High-quality Test Data

Realistic test data without real data

Python FastAPI PostgreSQL RabbitMQ MinIO PyTorch React Astro Docker Keycloak

The Challenge

Anyone training machine learning models, testing software, or conducting data-driven research faces a fundamental problem: a lack of realistic, structured test data. Real-world datasets are often incomplete, imbalanced, or unusable due to data privacy regulations. Manual creation ties up capacity for weeks and still rarely delivers the necessary variance and coverage — especially in sensitive domains like healthcare, law, or education.

The core problems

  • Manual data creation ties up R&D capacity for weeks, often without reproducible results
  • Real data is frequently subject to GDPR (DSGVO) regulations and cannot be freely shared or replicated
  • Existing tools typically cover only one approach — either rule-based or ML-powered, rarely both
  • No standardized process for reproducible, cross-team data generation
  • Relational dependencies like foreign keys and cardinalities are often lost in synthetic data

The Solution

INSIGHT is a research project funded by IFB Hamburg, developed by AKARA Solutions and HITeC e.V. (University of Hamburg). The project delivers a web-based platform for generating synthetic test data. Rather than a single approach, the platform unites three complementary use cases under one interface.

Our solution includes

  • Mockup — Rule-based generation with LLM assistance: For early project phases where no real data exists yet. Users define table structures, data types, and constraints such as primary keys, foreign keys, and unique constraints. From over 90 configurable data types, the platform generates relational datasets of any size. An integrated LLM chat assistant supports the entire workflow from schema definition to the finished file.
  • Replication — Statistical and AI-powered augmentation: For cases where real data exists but is too scarce or cannot be shared. Two methods are available: GaussianCopula learns statistical distributions and correlations of individual tables. For complex relational datasets with multiple linked tables, ClavaDDPM is used — a diffusion model based on recent research (NeurIPS 2024) that generates tables in topological order and preserves referential integrity across foreign keys.
  • Reduction — Intelligent downsizing: The reverse approach: Large datasets are reduced to representative subsets. Three strategies are available depending on requirements: stratified sampling for distribution preservation, near-duplicate detection via MinHash/LSH for deduplication, or FAISS-based clustering for vector similarity analysis.
Macbook_insight_landingpage

The Results

INSIGHT dramatically shortens the path from research idea to validated dataset — with full control over structure, quality, and scale of the synthetic data.

Mockup

  • Datasets are created in minutes instead of weeks, with no real data required
  • Over 90 configurable data types cover domain-specific requirements
  • Schemas can be version-controlled and reused across the team once created

Replication

  • Statistically realistic data at the push of a button — the trained model is reusable indefinitely
  • Automated quality evaluation with distribution comparison, correlation analysis, and coverage metrics, visualized in integrated dashboards
  • Multi-table generation with preserved referential integrity across foreign keys

Reduction

  • Large datasets reduced to representative subsets with measurable distribution fidelity
  • Three strategies for different requirements: distribution preservation, deduplication, or clustering