Overview - lexprep

v1.0.0 is live

Open research infrastructure for reproducible data preparation.

lexprep.net GitHub DOI: 10.5281/zenodo.18713755

MIT License • Python 3.10+

The Problem

Across experimental research, data preparation is one of the most time-consuming stages, yet it remains largely undocumented. Stimulus lists and datasets are built manually in spreadsheets, with no version control, no audit trails, and no standardized way to reproduce or share the process. Published studies rarely document how items were filtered, balanced, or allocated. This reproducibility gap undermines confidence in results and makes replication unnecessarily difficult.

The LexPrep Workflow

Enrich

Automated Linguistic Enrichment

G2P phonetic transcription
Syllable counting
POS tagging (UPOS + fine-grained)
Word length (Unicode codepoints)

Supports Persian, English, Japanese

Sample

Bias-Aware Stratified Sampling

Quantile-based stratification
Equal, proportional, optimal (Neyman) and fixed allocation methods
Seeded randomization for exact reproduction across runs

Audit

Machine-Readable Audit Trails

ZIP reproducibility pack per run
JSON manifest: tool, version, seed, timestamp, library versions
Per-bin statistics and excluded items
Full documentation of every decision

Supported Languages & Tools

Language	G2P	Syllables	POS
Persian	PersianG2p	Heuristic	Stanza
English	g2p-en (CMU)	pyphen (TeX)	spaCy
Japanese	-	-	Stanza/UniDic
Any	-	-	-

Example Use Case

A researcher preparing 200 Persian words for a reading experiment. With LexPrep:

Upload raw wordlist (XLSX/CSV/TXT)

Enrich: add G2P, syllables, POS in one step

Sample: stratified selection across frequency bins

Export: ZIP pack with manifest for replication

Total time: minutes, not hours.

Fully documented.

Roadmap

LexPrep aims to become the standard infrastructure for reproducible data preparation in experimental research.

Linguistic enrichment (G2P, syllables, POS)

Stratified sampling & multi-file shuffle

ZIP reproducibility packs with JSON manifest

Statistical validation & distribution testing

Frequency integration from external corpora

Additional language support

Lab-level collaboration features