Skip to content

03_data/

Corpus structure, licensing posture, provenance, and counts for the ~2.5M-sound training corpus.

What goes here: how the data is organised, where it came from, its licensing (licensed / CC0), manifest versions, and dataset-level statistics. Frontmatter type: data.

What does NOT go here: the audio itself, raw exports, or anything with licensing restrictions that shouldn't be version-controlled. Keep large/raw artifacts in their own storage and reference them here.

Owner: Daniil (@dsultanov), with Arseniy on taxonomy-linked items.


How to add to this folder

  1. Branch (bot/<slug> for agents, feat/<slug> for humans).
  2. Copy _TEMPLATE.md if present; otherwise start with frontmatter (see ../CONTRIBUTING.md §2).
  3. Search first — don't duplicate an existing file.
  4. make verify, commit, open a PR. Never push to main.