03_data/¶
Corpus structure, licensing posture, provenance, and counts for the ~2.5M-sound training corpus.
What goes here: how the data is organised, where it came from, its licensing
(licensed / CC0), manifest versions, and dataset-level statistics. Frontmatter
type: data.
What does NOT go here: the audio itself, raw exports, or anything with licensing restrictions that shouldn't be version-controlled. Keep large/raw artifacts in their own storage and reference them here.
Owner: Daniil (@dsultanov), with Arseniy on taxonomy-linked items.
How to add to this folder¶
- Branch (
bot/<slug>for agents,feat/<slug>for humans). - Copy
_TEMPLATE.mdif present; otherwise start with frontmatter (see../CONTRIBUTING.md§2). - Search first — don't duplicate an existing file.
make verify, commit, open a PR. Never push tomain.