Open research infrastructure for reproducible data preparation.
The Problem
Across experimental research, data preparation is one of the most time-consuming stages, yet it remains largely undocumented. Stimulus lists and datasets are built manually in spreadsheets, with no version control, no audit trails, and no standardized way to reproduce or share the process. Published studies rarely document how items were filtered, balanced, or allocated. This reproducibility gap undermines confidence in results and makes replication unnecessarily difficult.
The LexPrep Workflow
Enrich
Automated Linguistic Enrichment
- G2P phonetic transcription
- Syllable counting
- POS tagging (UPOS + fine-grained)
- Word length (Unicode codepoints)
Sample
Bias-Aware Stratified Sampling
- Quantile-based stratification
- Equal, proportional, optimal (Neyman) and fixed allocation methods
- Seeded randomization for exact reproduction across runs
Audit
Machine-Readable Audit Trails
- ZIP reproducibility pack per run
- JSON manifest: tool, version, seed, timestamp, library versions
- Per-bin statistics and excluded items
- Full documentation of every decision
Supported Languages & Tools
| Language | G2P | Syllables | POS | Length |
|---|---|---|---|---|
| Persian | PersianG2p | Heuristic | Stanza | |
| English | g2p-en (CMU) | pyphen (TeX) | spaCy | |
| Japanese | - | - | Stanza/UniDic | |
| Any | - | - | - |
Example Use Case
A researcher preparing 200 Persian words for a reading experiment. With LexPrep:
Total time: minutes, not hours.
Fully documented.
Roadmap
LexPrep aims to become the standard infrastructure for reproducible data preparation in experimental research.