Pully
10 hours ago
Internship : Testing Data Migrations with Synthetic Data: AI-powered approach
- 02 May 2026
- 100%
- Pully
Job summary
Join our team for an exciting internship in data migration! Work in a dynamic, collaborative environment while making a real impact.
Tasks
- Design a strategy for migration script testing and dataset generation.
- Implement a proof-of-concept system for realistic test datasets.
- Explore multi-agent architectures for efficient data generation.
Skills
- Strong Python skills, SQL knowledge, and familiarity with data security are essential.
- Experience with LLMs and agentic systems is a plus.
- Clear technical writing and problem-solving mindset are required.
Is this helpful?
About the job
Description
Data platform migrations are common in enterprise environments, moving from legacy systems to modern infrastructure while preserving business logic. The technical challenge isn't just syntax translation; it's validation. When developers migrate SQL scripts or data pipelines between platforms, they face different execution environments, modified data access permissions, and no safe way to test against production data.
This internship tackles synthetic data generation for migration script testing. You'll design and implement a system that generates realistic test datasets mirroring production structure and behavior without exposing sensitive information. There are different approaches, it could be a small dataset living in a git repository, or a fully-fledged synthetic data warehouse. Still, the data must be realistic enough to catch real bugs.
The challenge goes beyond simple data mocking. You'll need to decide whether to generate from real data (anonymization risks), from query analysis alone (requires good documentation), or hybrid approaches. Should categorical values match production exactly or can we substitute them and adapt the scripts? Can we extend unit-testing to end-to-end testing, and what would be the required dataset properties?
Part of the work involves establishing an evaluation methodology—potentially collecting a reference set of migration scripts and their expected behaviors to measure how well different synthetic data approaches catch real issues. There's potential to explore multi-agent architectures where specialized agents handle different aspects: schema analysis, constraint extraction, data generation, anonymization verification, and test validation. This is applied research with immediate production impact.
Objectives
- Design a strategy for migration script testing that balances realism, anonymization, and practical constraints
- Implement a proof-of-concept system that generates test datasets from schema documentation, existing queries, or (carefully) sampled production data
- Define testing strategies: unit tests vs. end-to-end tests, minimum viable data sizes, etc.
- Develop an evaluation methodology to measure the effectiveness of different synthetic data generation approaches
- Explore multi-agent architectures for decomposing the generation pipeline into specialized components (schema analysis, constraint satisfaction, validation)
Our offer
- A dynamic work and collaborative environment with a highly motivated multi-cultural and international sites team
- The chance to make a difference in peoples’ life by building innovative solutions
- Various internal coding events (Hackathon, Brownbags), see our technical blog
- Monthly After-Works organized per locations
Skills required
- Strong Python programming: data processing, testing patterns, CI/CD integration
- Understanding of relational databases, SQL, and data modeling concepts
- Experience with LLMs and agentic systems: prompting, tool use, multi-agent orchestration
- Familiarity with data security and data anonymization concepts
- Problem-solving mindset: comfort with ambiguous requirements and making justified technical trade-offs
- Clear technical writing and documentation skills
About the company
Pully