Last updated 30 May 2026
For any queries, please contact hi@nekear.me.
The research is conducted as part of a Master's in AI thesis at the University of Galway, Ireland.
It involves training an AI model that predicts how hard a specific word is for a learner, given the morphological and distributional features of lexical items (e.g., McDonald co-occurrence probability, COCA rank & range, etc.) and user-specific information such as their review performance on similar items, first language relatedness, current proficiency, etc.
Both the dataset collected within this survey and the findings of this research will be made public (privacy protections described below apply) to foster further studies in this domain.
There's prior work on word-difficulty modeling: Duolingo has published a couple of important datasets in this area (HLR in 2016, SLAM in 2018), but both capture learning within Duolingo's own curriculum: platform-chosen words, platform-formatted exercises, platform scheduling. The publicly missing part is data on what learners themselves chose to study, in any language, scheduled by a memory-faithful algorithm like FSRS, with the full card content intact. As for existing log datasets like open-spaced-repetition (which FSRS was built on), they strip the content out for privacy, while other public vocabulary research datasets don't include memory data. Neither side of what's needed currently exists publicly.
This survey is building the first dataset that has both. Once released publicly, it removes a real bottleneck for anyone working on personalized vocabulary learning.
Beyond the dataset, the research contributes a model that predicts word difficulty by combining two things usually studied separately: the linguistic properties of a word (its morphology and how it's distributed across real usage) and an individual's own memory patterns from their review history. Most prior work treats word difficulty as a fixed, population-level property, while this approach makes it personalized.
The most immediate benefit is that your ten-minute contribution will help build a dataset that will become a permanent public resource for the entire language-learning research community.
Longer term, this same research makes a new generation of learning tools possible:
And because the dataset will be public, anyone will be able to build them, not just one company.
To make the research outcomes meaningful, the dataset must meet specific requirements.
A learner is welcome to participate if:
This survey collects:
You're submitting an .apkg file - a ZIP archive containing an SQLite database and associated media files. The database (named collection.anki21b, collection.anki21, or collection.anki2 depending on the Anki version) holds all card data, review history, scheduling parameters, and configuration. This section describes every table and field exposed by that database to prepare you for the section about data-handling.
Each file exported from Anki contains the following data.
| File | Description |
|---|---|
collection.anki21b | Main SQLite database (modern format) |
meta | Protobuf-encoded format version metadata |
media | JSON mapping of numbered filenames to original media filenames |
0, 1, 2, … | Actual media files (images, audio, etc.) |
Below is an overview of what is stored in the collection.anki21b SQLite database.
colA single-row table holding collection-level configuration. All deck presets, note types, and global settings are stored here as JSON blobs.
| Field | Type | Description |
|---|---|---|
id | integer | Always 1 - only one row exists |
crt | integer | Collection creation timestamp (Unix seconds) |
mod | integer | Last modification timestamp (Unix milliseconds) |
scm | integer | Schema modification timestamp |
ver | integer | Schema version number |
usn | integer | Update sequence number, used for AnkiWeb sync |
ls | integer | Last sync timestamp |
conf | JSON text | Global configuration: scheduler version, timezone offset, sort field, sort order, new card insertion order, etc. |
models | JSON text | All note types (models): field definitions, card templates, CSS styling |
decks | JSON text | All deck definitions: names, parent/child hierarchy, per-deck limits |
dconf | JSON text | Deck configuration presets - contains FSRS weights (fsrsWeights, desiredRetention, ignoreRevlogsBeforeDate) and legacy SM-2 ease settings |
tags | JSON text | Registry of all tags used across the collection |
dconf sub-fields (FSRS-relevant)| Sub-field | Description |
|---|---|
fsrsWeights | Array of 17-19 floats representing the optimised FSRS model parameters |
desiredRetention | Target retention rate, e.g. 0.9 for 90% |
ignoreRevlogsBeforeDate | ISO date string - review logs before this date are excluded from FSRS optimisation |
newPerDay | Maximum new cards per day |
maxIvl | Maximum interval in days |
notesOne row per note. A note is the source of content; one note can generate multiple cards via its template.
| Field | Type | Description |
|---|---|---|
id | integer | Note creation timestamp (Unix milliseconds); serves as primary key |
guid | text | Globally unique 10-character identifier, used to prevent duplicates on import |
mid | integer | Foreign key → note type ID in col.models |
mod | integer | Last modification timestamp (Unix seconds) |
usn | integer | Update sequence number for sync |
tags | text | Space-separated list of tags (e.g. biology anatomy) |
flds | text | All field values concatenated, separated by the \x1f (ASCII unit separator) character - field order matches the model definition in col.models |
sfld | text | Value of the sort field (used for browser sorting) |
csum | integer | Numeric checksum of the first field, used for duplicate detection |
flags | integer | Reserved flags |
data | text | Extra data blob, used by add-ons |
cardsOne row per card. Multiple cards can share the same parent note (e.g. forward and reverse cards).
| Field | Type | Description |
|---|---|---|
id | integer | Card creation timestamp (Unix milliseconds); primary key |
nid | integer | Foreign key → parent note id |
did | integer | Foreign key → deck id in col.decks |
ord | integer | Template index (0-based) within the note type - identifies which card template generated this card |
mod | integer | Last modification timestamp (Unix seconds) |
usn | integer | Update sequence number for sync |
type | integer | Card state: 0 = new, 1 = learning, 2 = review, 3 = relearning |
queue | integer | Scheduling queue: -3 = sched buried, -2 = user buried, -1 = suspended, 0 = new, 1 = learning, 2 = review, 3 = day-learn, 4 = preview |
due | integer | Due date; meaning varies: for new cards = position, for review = days since collection creation, for learning = Unix timestamp |
ivl | integer | Current interval in days (negative value = seconds, used during learning steps) |
factor | integer | Ease factor × 1000 (e.g. 2500 = 250% ease); used in legacy SM-2 scheduler |
reps | integer | Total number of reviews |
lapses | integer | Number of times the card was answered "Again" after graduation |
left | integer | Remaining learning steps (encoded: left % 1000 = reps left today) |
odue | integer | Original due date, used when card is in a filtered deck |
odid | integer | Original deck ID, used when card is in a filtered deck |
flags | integer | User-set colour flag (0–7) |
data | text | JSON blob for extra scheduler data — FSRS stores stability s and difficulty d here, e.g. {"s":15.3,"d":6.8} |
revlogOne row per individual review event. This is the primary source of data for FSRS optimisation.
| Field | Type | Description |
|---|---|---|
id | integer | Review timestamp (Unix milliseconds); primary key |
cid | integer | Foreign key → card id |
usn | integer | Update sequence number for sync |
ease | integer | Answer button pressed: 1 = Again, 2 = Hard, 3 = Good, 4 = Easy |
ivl | integer | Interval after this review (days; negative = seconds) |
lastIvl | integer | Interval before this review (days; negative = seconds) |
factor | integer | Ease factor after this review × 1000 |
time | integer | Time taken to answer (milliseconds, capped at 60,000) |
type | integer | Review type: 0 = learning, 1 = review, 2 = relearn, 3 = filtered, 4 = manual |
gravesTracks deleted objects (cards, notes, decks) for sync reconciliation. This table is not collected - see the exclusions section below.
| Field | Type | Description |
|---|---|---|
usn | integer | Update sequence number at time of deletion |
oid | integer | Original ID of the deleted object |
type | integer | Object type: 0 = card, 1 = note, 2 = deck |
This section specifies which fields are collected, excluded, or transformed before inclusion in the research dataset for the purposes stated in the "What is collected" section.
Fields fall into three buckets: collected as-is, collected with transformation, or excluded entirely. Reasoning is grouped by category rather than enumerated per field, since most exclusions share a single rationale.
The following fields carry no research signal and/or can act as a quasi-identifier (sync timestamps cluster by user device), ultimately making it possible to track patterns unrelated to language learning. Therefore, these fields are removed before your submission and this data is never collected.
Sync metadata - only meaningful for AnkiWeb reconciliation.
col.usn, col.ls, col.mod, col.dconf.ignoreRevlogsBeforeDate, col.dconf.newPerDay, col.dconf.maxIvlnotes.usn, notes.modcards.usn, cards.mod, cards.left, cards.odue, cards.odidrevlog.usngraves tableInternal deduplication identifiers - used by Anki's import/sync logic.
notes.guid (10-character globally unique identifier)notes.csum (checksum of the first field)User personal annotations - color flags are user-private categorical signals (e.g. "card I'm anxious about"), and the add-on data blob has unknown structure and may contain anything an installed add-on chose to write.
cards.flags, notes.flags (user-set colour flags)notes.data (arbitrary add-on data blob)Redundant data
col.tags (global tag registry - already captured per-note via notes.tags after filtering). See below on further tag processing.Note-type styling and templates - CSS and card templates can encode user customisation patterns and occasionally contain identifying comments.
col.models is reduced to field definitions only (field names, ordering); CSS, card templates, and template names are dropped before submission.Timestamps - with per-user random offset. Every timestamp in a single user's submission is offset by the same random constant. This mechanism preserves all inter-event intervals - the only temporal signal SRS uses - while destroying alignment with real wall-clock time.
revlog.id, col.crt, notes.id, cards.idApplied post submission, but before the final dataset is released.
Identifiers - synthetic per-user counters. Real Anki IDs (note, card, deck, model, review) are replaced with synthetic counters during processing.
notes.id, cards.id, cards.nid, cards.did, revlog.cid, notes.midApplied post submission, but before the final dataset is released.
Tags. Tags may be either linguistically useful (noun, irregular_verb, feminine) or idiosyncratic (for_friday_test, names). The survey provides a way to remove tags that you don't want to include in the final dataset.
notes.tagsDeck names - replaced with neutral identifiers. User-set deck names (which can leak topic, purpose, or personal context - e.g. "German for Anna's wedding") are replaced with neutral identifiers. Hierarchy is preserved; semantic content is not.
col.decks → nameCard content - passes through your review before submission. Before any data leaves your machine, you see a note-by-note preview of notes.flds content with an include/exclude toggle for each note. In Anki, one note can generate one or more cards, so excluding a note removes every card generated from that note from the submission entirely, along with their review history and scheduling state. This is in-the-loop consent, not post-hoc redaction. Your excluded note data is NOT included in the submission in any form.
You can also remove selected fields for a whole note type. In that case, the submitted payload keeps the field name and field position so the research can account for what kind of cue was present, but the field value is replaced with null for every included note using that note type. Field removal does not remove the note, cards, or review logs.
notes.flds, notes.sfldcards.type, cards.queue, cards.due, cards.ivl, cards.factor, cards.reps, cards.lapses, cards.ordcards.data (stability s, difficulty d)revlog.ease, revlog.ivl, revlog.lastIvl, revlog.factor, revlog.time, revlog.typecol.dconf → fsrsWeights, desiredRetentioncol.models → field definitions only (names, ordering)col.conf → scheduler versioncol.verYou can inspect the resulting submission file by downloading the
.jsonbefore clicking the "Submit" button.
The pre-submission UI runs a local pattern scanner over card content and flags potential PII for attention:
Flagged notes surface in the review UI with the trigger highlighted. You can decide whether to include or exclude each note.
Card content is additionally scanned locally for terms in health, religious, and political categories. When matches are found above a small threshold, you're shown an additional explicit notice before submission.
Withdrawal requests immediately mark linked submissions as withdrawn and exclude them from research processing and public release. Withdrawn payloads are permanently deleted from live storage during scheduled maintenance, normally within 7 days. Post-release withdrawal, scheduled for October 2026, is not supported.
To further reduce any residual risk of re-identification, the public dataset will include submissions for a given target language only if at least three participants for that language are represented in the collected data. This threshold is intended to help ensure that the anonymization measures described above remain sufficient, even in cases where a learner has publicly shared parts of their learning history elsewhere.