ploosh.
Documentation
Testing approaches
Ploosh is a flexible framework, but the intelligence of the tests lies in the people who write them. This guide presents three key approaches to structuring your data tests.
By recalculation
This approach validates the results produced by ETL processes by recalculating the same operations using independent SQL queries.
Principle
Write an SQL query that reproduces the calculations and transformations performed by the ETL on a given dataset. Then compare the results of this query to the ETL output. If both match, the test passes.
Example
Suppose an ETL aggregates sales data by category and region. To verify the results:
Test sales aggregation:
source:
connection: dwh
type: mssql
query: |
SELECT category, region, SUM(salesamount) AS totalsales
FROM dwh.fact_sales
GROUP BY category, region expected:
connection: dmt
type: mssql
query: |
SELECT category, region, total_sales
FROM dmt.factsalesaggregated
Use cases
- Verification of complex aggregations (monthly sales by product and region)
- Calculation of ratios or KPIs derived from multiple sources
- Validation of business rules at each stage of data processing
Trade-offs
The main effort is writing SQL queries that replicate complex transformations. However, it provides fine control and pinpoints exactly where errors occur.
By empty
This approach relies on the empty connector for the expected part. The source query should return no data. If it does, the test fails.
Principle
Define a source query whose criteria identify incorrect or undesirable data. If the query returns results, it means anomalies exist. If no data is returned, the test passes.
Example
Check for rejected records inserted in the last 24 hours:
Test no recent rejects:
source:
connection: dwh
type: mssql
query: |
SELECT *
FROM dwh.factsalesrejects
WHERE date_insert BETWEEN DATEADD(day, -1, GETDATE()) AND GETDATE()
expected:
type: empty
Use cases
- Data quality: Ensure invalid data is not present in target tables
- Duplicate detection: Verify no duplicate records exist
- Consistency checks: Ensure relationships and business rules are respected
- NULL checks: Verify no NULL values in mandatory columns
- Range validation: Ensure values fall within expected ranges
Advantages
- Simplicity: Only a source query is needed, no expected query
- Fast implementation: Rapid creation of tests for critical checks
- Flexibility: Adaptable to many data validation scenarios
By test data
This approach uses predefined test data inserted into source systems to simulate specific scenarios.
Principle
- Insert known test data into the data sources
- Run the ETL processes
- Compare the output against pre-calculated expected results
Example
Using a CSV file with pre-calculated expected results:
Test transformation rules:
options:
sort:
- employee_id
source:
connection: dmt
type: mysql
query: |
SELECT employeeid, fullname, department, is_active
FROM dmt.dim_employee
WHERE employee_id IN (9001, 9002, 9003)
expected:
type: csv
path: ./testdata/expectedemployees.csv
Use cases
- Validating complex business rules with specific edge cases
- Testing data type conversions and format transformations
- Verifying calculations with known inputs and outputs
Trade-offs
Requires functional understanding of the system and effort to maintain test datasets. However, it provides the most deterministic and reproducible tests.
Choosing an approach
| Approach | Best for | Complexity | Maintenance |
|---|---|---|---|
| Recalculation | Aggregations, KPIs, business rules | Medium | Medium |
| Empty | Quality checks, duplicates, anomalies | Low | Low |
| Test data | Edge cases, transformations, rules | High | High |