> Recent Pipelines

Legal‑domain corpus

2.1B token curated dataset from UK legal sources for a London‑based legal AI startup. Included synthetic summarisation labels.

[legal_v1.4.jol]

Synthetic retail imagery

500k annotated product images with bounding boxes and attributes for computer vision shelf analysis.

[retail_synth_v2.jol]

News syndication feed

Real‑time structured news feed with sentiment labels and entity extraction, powering a financial sentiment model.

[news_feed_live.jol]

Medical imaging metadata

Synthetic labels and segmentation masks for 200k chest X‑rays, enabling rare pathology detection.

[med_img_v1.jol]

Code documentation pairs

Curated dataset of 10M code snippets with synthetic natural language descriptions for code‑LLM training.

[code_synth.jol]

Multilingual web crawl

Cleaned and deduplicated web crawl data across 12 languages with quality scoring and topic classification.

[web_crawl_v3.jol]
🍪 Data consent

We use essential cookies to optimise your session.