2.1B token curated dataset from UK legal sources for a London‑based legal AI startup. Included synthetic summarisation labels.
[legal_v1.4.jol]500k annotated product images with bounding boxes and attributes for computer vision shelf analysis.
[retail_synth_v2.jol]Real‑time structured news feed with sentiment labels and entity extraction, powering a financial sentiment model.
[news_feed_live.jol]Synthetic labels and segmentation masks for 200k chest X‑rays, enabling rare pathology detection.
[med_img_v1.jol]Curated dataset of 10M code snippets with synthetic natural language descriptions for code‑LLM training.
[code_synth.jol]Cleaned and deduplicated web crawl data across 12 languages with quality scoring and topic classification.
[web_crawl_v3.jol]