SciTransfer
ALIGNED · Project

Tools That Keep Your Web Data Clean and Your Software in Sync

digitalTestedTRL 5

Imagine you run a business that pulls information from the web — prices, medical records, product data — and builds software around it. Every time the data changes, your software breaks. ALIGNED built tools that let your data and your software evolve together, automatically catching errors and keeping everything consistent. Think of it like spell-check, but for entire databases — it flags bad data before it causes expensive problems downstream.

By the numbers
EUR 3,999,934
EU funding for tool development
7
consortium partners
5
countries in consortium
19
total deliverables produced
4
working demo tools delivered
43%
industry partner ratio in consortium
3
industry partners including Wolters Kluwer and Semantic Web Company
The business problem

What needed solving

Companies that build applications on top of web data spend enormous effort keeping their software and data in sync. When data sources change format or quality drops, applications break — and finding the problem manually is slow and expensive. There is no standard, lightweight way to automatically test data quality and evolve software alongside the data it depends on.

The solution

What was built

The project delivered working prototypes including: a Semantic Booster that auto-generates code transformations from system specifications, a Model Catalogue Tool integrated with Eclipse IDE for managing data models, and an automated data testing and verification tool (based on RDFUnit) that generates and runs test cases for data consistency, completeness, and coverage — integrated into standard Maven build processes.

Audience

Who needs this

Legal and financial information publishers managing large structured datasetsHealthcare IT companies integrating patient data from multiple sourcesEnterprise knowledge management platforms using Linked DataData quality teams at companies building applications on web-scraped dataOpen data publishers maintaining public datasets
Business applications

Who can put this to work

Legal & Financial Information Services
enterprise
Target: Publishers managing large regulatory or financial datasets

If you are a legal information publisher dealing with constantly changing regulations across multiple jurisdictions — this project developed an automated data testing tool that checks your datasets for consistency, completeness, and coverage before they reach your customers. Wolters Kluwer, one of the 3 industry partners, used these tools in their production Linked Data systems.

Healthcare IT
enterprise
Target: Companies building clinical data systems for hospitals or national health services

If you are a healthcare IT provider struggling to keep patient data systems updated as medical standards and data formats evolve — this project built a Semantic Booster that auto-generates software transformations from system specifications. Oxford used these methods to transform NHS systems, showing the approach works at national scale.

Data Management & Analytics
mid-size
Target: Companies that extract, clean, and structure web data for business use

If you are a data company spending too much time manually checking whether scraped or integrated web data is accurate and complete — this project delivered an automated verification tool integrated with Maven build processes. It generates test cases automatically from your data specifications, catching quality issues before they propagate through your pipeline.

Frequently asked

Quick answers

What would it cost to implement these tools?

The tools were developed as open-source prototypes with a total EU investment of EUR 3,999,934 across 7 partners. Integration costs would depend on your existing tech stack — the data testing tool plugs into Maven build processes, and the Model Catalogue integrates with Eclipse IDE, so teams already using these will have lower adoption costs.

Can these tools handle enterprise-scale data?

The tools were validated with real-world partners handling large datasets: Wolters Kluwer processes legal information at scale, and Oxford applied the methods to NHS systems. The automated test generation approach is designed to scale — it generates test cases from specifications rather than requiring manual test writing.

What is the IP and licensing situation?

The project committed to open-source tool releases and engaged standards bodies including OMG, W3C, and ISO for technology transfer. Based on available project data, the core tools (Semantic Booster, Model Catalogue, RDFUnit-based testing) were developed as open-source, meaning licensing costs should be minimal or zero.

How mature are these tools — can I use them today?

All 4 demo deliverables reached Phase 3 (final prototype stage) by project end in January 2018. They are functional prototypes tested with real partners, not just lab experiments. However, as the project closed in 2018, you should verify current maintenance status and community activity.

How do these integrate with our existing development workflow?

The tools were specifically designed for integration: the Model Catalogue works within Eclipse IDE, and the data testing tool integrates with Maven build processes. This means they slot into standard Java/enterprise development workflows rather than requiring a separate toolchain.

Is there ongoing support or a community behind these tools?

The project included 19 total deliverables covering scientific publications, training programs, and standards engagement with W3C, OMG, and ISO. The consortium of 7 partners across 5 countries built a research community, but ongoing commercial support would likely need to come through the industry partners or a service agreement.

Consortium

Who built it

The 7-partner consortium across 5 countries (Austria, Germany, Ireland, Poland, UK) has a strong 43% industry ratio with 3 industry partners alongside 4 universities. This is a well-balanced team: Trinity College Dublin coordinated, with world-class research from Oxford and Leipzig (co-creators of DBpedia). On the business side, Wolters Kluwer brings real production data systems, and Semantic Web Company is a recognized leader in enterprise Linked Data. The 1 SME in the group adds agility. For a potential business buyer, the presence of Wolters Kluwer as a validation partner is the strongest signal — they would not participate unless the tools addressed real production needs.

How to reach the team

Trinity College Dublin coordinated this project. The research team can be reached through the university's Computer Science department.

Next steps

Talk to the team behind this work.

Want to know if ALIGNED's automated data quality tools fit your data pipeline? We can arrange a technical briefing with the research team and assess integration feasibility for your specific setup.