ploosh.
Documentation
Regression testing
Ploosh can be used as a regression testing framework to ensure that changes to data pipelines do not break existing functionality.
Context
In data projects, every modification to the ETL chain introduces a risk of regression:
- New feature development
- Bug fixes
- Infrastructure changes
- Dependency updates
- Schema modifications
Strategy
Build a test suite incrementally
- Start with critical tests: Focus on the most important tables and business rules
- Add tests on bug discovery: When a bug is found, create a test case that would catch it
- Cover all layers: Test across the entire data chain (raw → warehouse → datamart)
Example test suite
# Regression test: employee count by department
Test employee count:
options:
sort:
- department
source:
connection: dwh
type: mssql
query: |
SELECT department, COUNT(*) AS count
FROM dwh.dim_employee
WHERE is_active = 1
GROUP BY department
expected:
connection: dmt
type: mssql
query: |
SELECT department, employee_count AS count
FROM dmt.department_summary
# Regression test: no orphan records
Test no orphan orders:
source:
connection: dwh
type: mssql
query: |
SELECT o.order_id
FROM dwh.fact_orders o
LEFT JOIN dwh.dimcustomer c ON o.customerid = c.customer_id
WHERE c.customer_id IS NULL
expected:
type: empty
CI/CD integration
Run regression tests automatically after every deployment:
- Deploy new code to the data platform
- Execute data pipelines
- Run Ploosh regression suite
- Publish results to Azure DevOps / GitHub
See Azure DevOps pipeline and GitHub Actions for integration guides.
Best practices
- Version control test cases: Store YAML files alongside pipeline code in Git
- Run on every deployment: Automate execution in CI/CD pipelines
- Collaborative maintenance: Tests can be written by developers, testers, and business analysts
- Use pass_rate for tolerance: Allow minor acceptable differences instead of strict matching
- Disable flaky tests: Use the
disabledoption to temporarily skip unstable tests while investigating