v1.0.0 is live
LexPrep

Open research infrastructure for reproducible data preparation.

Watch Video
MIT License Python 3.10+

The Problem

Across experimental research, data preparation is one of the most time-consuming stages, yet it remains largely undocumented. Stimulus lists and datasets are built manually in spreadsheets, with no version control, no audit trails, and no standardized way to reproduce or share the process. Published studies rarely document how items were filtered, balanced, or allocated. This reproducibility gap undermines confidence in results and makes replication unnecessarily difficult.

The LexPrep Workflow

1

Enrich

Automated Linguistic Enrichment

  • G2P phonetic transcription
  • Syllable counting
  • POS tagging (UPOS + fine-grained)
  • Word length (Unicode codepoints)
Supports Persian, English, Japanese
2

Sample

Bias-Aware Stratified Sampling

  • Quantile-based stratification
  • Equal, proportional, optimal (Neyman) and fixed allocation methods
  • Seeded randomization for exact reproduction across runs
3

Audit

Machine-Readable Audit Trails

  • ZIP reproducibility pack per run
  • JSON manifest: tool, version, seed, timestamp, library versions
  • Per-bin statistics and excluded items
  • Full documentation of every decision

Supported Languages & Tools

Language G2P Syllables POS Length
Persian PersianG2p Heuristic Stanza
English g2p-en (CMU) pyphen (TeX) spaCy
Japanese - - Stanza/UniDic
Any - - -

Example Use Case

A researcher preparing 200 Persian words for a reading experiment. With LexPrep:

1
Upload raw wordlist (XLSX/CSV/TXT)
2
Enrich: add G2P, syllables, POS in one step
3
Sample: stratified selection across frequency bins
4
Export: ZIP pack with manifest for replication

Total time: minutes, not hours.

Fully documented.

Roadmap

LexPrep aims to become the standard infrastructure for reproducible data preparation in experimental research.

Linguistic enrichment (G2P, syllables, POS)
Stratified sampling & multi-file shuffle
ZIP reproducibility packs with JSON manifest
Statistical validation & distribution testing
Frequency integration from external corpora
Additional language support
Lab-level collaboration features