How we handle your data

Last updated 12 May 2026

How your data is structured
What is collected
How your data is handled

How your data is structured

An .apkg file is a ZIP archive containing an SQLite database and associated media files. The database (named collection.anki21b, collection.anki21, or collection.anki2 depending on the Anki version) holds all card data, review history, scheduling parameters, and configuration. This document describes every table and field exposed by that database, and specifies which fields are collected, excluded, or transformed before inclusion in the research dataset.

The dataset is intended for public release after the research is completed. The collection decisions and mitigations below are designed against that bar.

File Structure

Each file exported from Anki has the following data inside.

File	Description
`collection.anki21b`	Main SQLite database (modern format)
`meta`	Protobuf-encoded format version metadata
`media`	JSON mapping of numbered filenames to original media filenames
`0`, `1`, `2`, …	Actual media files (images, audio, etc.)

Below is an overview of what is stored in the collection.anki21b SQLite database.

Table: `col`

A single-row table holding collection-level configuration. All deck presets, note types, and global settings are stored here as JSON blobs.

Field	Type	Description
`id`	integer	Always `1` - only one row exists
`crt`	integer	Collection creation timestamp (Unix seconds)
`mod`	integer	Last modification timestamp (Unix milliseconds)
`scm`	integer	Schema modification timestamp
`ver`	integer	Schema version number
`usn`	integer	Update sequence number, used for AnkiWeb sync
`ls`	integer	Last sync timestamp
`conf`	JSON text	Global configuration: scheduler version, timezone offset, sort field, sort order, new card insertion order, etc.
`models`	JSON text	All note types (models): field definitions, card templates, CSS styling
`decks`	JSON text	All deck definitions: names, parent/child hierarchy, per-deck limits
`dconf`	JSON text	Deck configuration presets - contains FSRS weights (`fsrsWeights`, `desiredRetention`, `ignoreRevlogsBeforeDate`) and legacy SM-2 ease settings
`tags`	JSON text	Registry of all tags used across the collection

Key `dconf` sub-fields (FSRS-relevant)

Sub-field	Description
`fsrsWeights`	Array of 17-19 floats representing the optimised FSRS model parameters
`desiredRetention`	Target retention rate, e.g. `0.9` for 90%
`ignoreRevlogsBeforeDate`	ISO date string - review logs before this date are excluded from FSRS optimisation
`newPerDay`	Maximum new cards per day
`maxIvl`	Maximum interval in days

Table: `notes`

One row per note. A note is the source of content; one note can generate multiple cards via its template.

Field	Type	Description
`id`	integer	Note creation timestamp (Unix milliseconds); serves as primary key
`guid`	text	Globally unique 10-character identifier, used to prevent duplicates on import
`mid`	integer	Foreign key → note type ID in `col.models`
`mod`	integer	Last modification timestamp (Unix seconds)
`usn`	integer	Update sequence number for sync
`tags`	text	Space-separated list of tags (e.g. `biology anatomy`)
`flds`	text	All field values concatenated, separated by the `\x1f` (ASCII unit separator) character - field order matches the model definition in `col.models`
`sfld`	text	Value of the sort field (used for browser sorting)
`csum`	integer	Numeric checksum of the first field, used for duplicate detection
`flags`	integer	Reserved flags
`data`	text	Extra data blob, used by add-ons

Table: `cards`

One row per card. Multiple cards can share the same parent note (e.g. forward and reverse cards).

Field	Type	Description
`id`	integer	Card creation timestamp (Unix milliseconds); primary key
`nid`	integer	Foreign key → parent note `id`
`did`	integer	Foreign key → deck `id` in `col.decks`
`ord`	integer	Template index (0-based) within the note type - identifies which card template generated this card
`mod`	integer	Last modification timestamp (Unix seconds)
`usn`	integer	Update sequence number for sync
`type`	integer	Card state: `0` = new, `1` = learning, `2` = review, `3` = relearning
`queue`	integer	Scheduling queue: `-3` = sched buried, `-2` = user buried, `-1` = suspended, `0` = new, `1` = learning, `2` = review, `3` = day-learn, `4` = preview
`due`	integer	Due date; meaning varies: for new cards = position, for review = days since collection creation, for learning = Unix timestamp
`ivl`	integer	Current interval in days (negative value = seconds, used during learning steps)
`factor`	integer	Ease factor × 1000 (e.g. `2500` = 250% ease); used in legacy SM-2 scheduler
`reps`	integer	Total number of reviews
`lapses`	integer	Number of times the card was answered "Again" after graduation
`left`	integer	Remaining learning steps (encoded: `left % 1000` = reps left today)
`odue`	integer	Original due date, used when card is in a filtered deck
`odid`	integer	Original deck ID, used when card is in a filtered deck
`flags`	integer	User-set colour flag (0–7)
`data`	text	JSON blob for extra scheduler data — FSRS stores stability `s` and difficulty `d` here, e.g. `{"s":15.3,"d":6.8}`

Table: `revlog`

One row per individual review event. This is the primary source of data for FSRS optimisation.

Field	Type	Description
`id`	integer	Review timestamp (Unix milliseconds); primary key
`cid`	integer	Foreign key → card `id`
`usn`	integer	Update sequence number for sync
`ease`	integer	Answer button pressed: `1` = Again, `2` = Hard, `3` = Good, `4` = Easy
`ivl`	integer	Interval after this review (days; negative = seconds)
`lastIvl`	integer	Interval before this review (days; negative = seconds)
`factor`	integer	Ease factor after this review × 1000
`time`	integer	Time taken to answer (milliseconds, capped at 60,000)
`type`	integer	Review type: `0` = learning, `1` = review, `2` = relearn, `3` = filtered, `4` = manual

Table: `graves`

Tracks deleted objects (cards, notes, decks) for sync reconciliation. This table is not collected - see the exclusions section below.

Field	Type	Description
`usn`	integer	Update sequence number at time of deletion
`oid`	integer	Original ID of the deleted object
`type`	integer	Object type: `0` = card, `1` = note, `2` = deck

What is collected

We're collecting:

Card contents - to estimate population-level complexity and find lexical neighbors;
Anki review logs - to find memory patterns across neighbors and train a model that predicts how many reviews you'd realistically need to master a given word (the prediction target may change as the research progresses);
Languages you're proficient in (C1+) - L1 and other known languages transfer knowledge into new ones;
Your interests or domain - this affects your prior exposure, which shifts your "similarity" to certain words (e.g., as an AI researcher I'm more familiar with technical terms than biological ones);
Consent to include your submitted data in the final public dataset. There's very little research in this area precisely because no large public datasets exist on how personalization changes lexical complexity. A public dataset reflecting how memory responds to different vocabulary types would really strongly push the field forward.

How your data is handled

Fields fall into three buckets: collected as-is, collected with transformation, or excluded entirely. Reasoning is grouped by category rather than enumerated per field, since most exclusions share a single rationale.

Excluded entirely

The following fields carry no research signal and/or can act as a quasi-identifier (sync timestamps cluster by user device) ultimately making it possible to track patterns unrelated to language learning. Therefore, we remove these fields before your submission and never collect this data.

Sync metadata - only meaningful for AnkiWeb reconciliation.

col.usn, col.ls, col.mod
notes.usn, notes.mod
cards.usn, cards.mod
revlog.usn
The entire graves table

Internal deduplication identifiers - used by Anki's import/sync logic.

notes.guid (10-character globally unique identifier)
notes.csum (checksum of the first field)

User personal annotations - colour flags are user-private categorical signals (e.g. "card I'm anxious about"), and the add-on data blob has unknown structure and may contain anything an installed add-on chose to write.

cards.flags, notes.flags (user-set colour flags)
notes.data (arbitrary add-on data blob)

Redundant data

col.tags (global tag registry - already captured per-note via notes.tags after filtering). See below on further tag processing.

Note-type styling and templates - CSS and card templates can encode user customisation patterns and occasionally contain identifying comments.

col.models is reduced to field definitions only (field names, ordering); CSS, card templates, and template names are dropped before submission.

Collected with transformation

Timestamps - with per-user random offset. Every timestamp in a single user's submission is offset by the same random constant drawn at submission time. This preserves all inter-event intervals - the only temporal signal SRS uses - while destroying alignment with real wall-clock time. This stage eliminates inference of sleep schedule, work hours, timezone, and weekday patterns from review timing.

Applies to: revlog.id, col.crt, notes.id, cards.id
The offset is not stored.

Applied post submission, but before the final dataset is released.

Identifiers - synthetic per-user counters. Real Anki IDs (note, card, deck, model, review) are replaced with synthetic counters during processing. This means that a future Anki export from the same user cannot be cross-referenced against published submissions.

Applies to: notes.id, cards.id, cards.nid, cards.did, revlog.cid, notes.mid
The mapping from real to synthetic IDs is not retained.

Applied post submission, but before the final dataset is released.

Tags. Tags may be either linguistically useful (noun, irregular_verb, feminine) or idiosyncratic (for_friday_test, names). We're providing a way to remove tags which you don't want to include in the final dataset.

Applies to: notes.tags

Deck names - replaced with neutral identifiers. User-set deck names (which can leak topic, purpose, or personal context - e.g. "German for Anna's wedding") are replaced with neutral identifiers. Hierarchy is preserved; semantic content is not.

Applies to: col.decks → name

Card content - passes through your review before submission. Before any data leaves your machine, you see a card-by-card preview of notes.flds content with a per-card include/exclude toggle. Excluded cards are removed from the submission entirely along with their review history and scheduling state. This is in-the-loop consent, not post-hoc redaction. We never see what you exclude.

Applies to: notes.flds, notes.sfld

Collected as-is

Scheduling state: cards.type, cards.queue, cards.due, cards.ivl, cards.factor, cards.reps, cards.lapses, cards.left, cards.ord, cards.odue, cards.odid
FSRS per-card memory state: cards.data (stability s, difficulty d)
Review events: revlog.ease, revlog.ivl, revlog.lastIvl, revlog.factor, revlog.time, revlog.type
FSRS configuration per deck: col.dconf → fsrsWeights, desiredRetention, ignoreRevlogsBeforeDate, newPerDay, maxIvl
Note-type structure: col.models → field definitions only (names, ordering)
Collection-level scheduler config: col.conf → scheduler version, timezone offset (rounded to nearest hour), sort field
Schema version: col.ver

Additional strategies

1. Automated PII pattern scanner

The pre-submission UI runs a local pattern scanner over card content and flags potential PII for the your's attention:

Email addresses;
Phone numbers (multiple formats and country conventions);
Long digit sequences (potential ID numbers, card numbers, dates of birth);
URLs;
Common given-name and surname lists for the languages being studied.

Flagged cards surface in the review UI with the trigger highlighted. You can decide whether to include or exclude.

2. Special-category content check

Card content is additionally scanned locally for terms in health, religious, and political categories. When matches are found above a small threshold, you're shown an additional explicit notice before submission, since under GDPR Article 9 these are special categories of personal data requiring more explicit consent.

3. Withdrawal mechanism

Each submission generates a one-way submission token retained by the participant. Submitting the token to a withdrawal endpoint removes the contribution from the working dataset and, where the public release has not yet occurred, from the eventual release - processed on a scheduled basis. Post-release withdrawal, scheduled for October 2026, is not supported.

Go to survey