How we handle your data

Go to survey

Last updated 12 May 2026

How your data is structured

An .apkg file is a ZIP archive containing an SQLite database and associated media files. The database (named collection.anki21b, collection.anki21, or collection.anki2 depending on the Anki version) holds all card data, review history, scheduling parameters, and configuration. This document describes every table and field exposed by that database, and specifies which fields are collected, excluded, or transformed before inclusion in the research dataset.

The dataset is intended for public release after the research is completed. The collection decisions and mitigations below are designed against that bar.

File Structure

Each file exported from Anki has the following data inside.

FileDescription
collection.anki21bMain SQLite database (modern format)
metaProtobuf-encoded format version metadata
mediaJSON mapping of numbered filenames to original media filenames
0, 1, 2, …Actual media files (images, audio, etc.)

Below is an overview of what is stored in the collection.anki21b SQLite database.

Table: col

A single-row table holding collection-level configuration. All deck presets, note types, and global settings are stored here as JSON blobs.

FieldTypeDescription
idintegerAlways 1 - only one row exists
crtintegerCollection creation timestamp (Unix seconds)
modintegerLast modification timestamp (Unix milliseconds)
scmintegerSchema modification timestamp
verintegerSchema version number
usnintegerUpdate sequence number, used for AnkiWeb sync
lsintegerLast sync timestamp
confJSON textGlobal configuration: scheduler version, timezone offset, sort field, sort order, new card insertion order, etc.
modelsJSON textAll note types (models): field definitions, card templates, CSS styling
decksJSON textAll deck definitions: names, parent/child hierarchy, per-deck limits
dconfJSON textDeck configuration presets - contains FSRS weights (fsrsWeights, desiredRetention, ignoreRevlogsBeforeDate) and legacy SM-2 ease settings
tagsJSON textRegistry of all tags used across the collection

Key dconf sub-fields (FSRS-relevant)

Sub-fieldDescription
fsrsWeightsArray of 17-19 floats representing the optimised FSRS model parameters
desiredRetentionTarget retention rate, e.g. 0.9 for 90%
ignoreRevlogsBeforeDateISO date string - review logs before this date are excluded from FSRS optimisation
newPerDayMaximum new cards per day
maxIvlMaximum interval in days

Table: notes

One row per note. A note is the source of content; one note can generate multiple cards via its template.

FieldTypeDescription
idintegerNote creation timestamp (Unix milliseconds); serves as primary key
guidtextGlobally unique 10-character identifier, used to prevent duplicates on import
midintegerForeign key → note type ID in col.models
modintegerLast modification timestamp (Unix seconds)
usnintegerUpdate sequence number for sync
tagstextSpace-separated list of tags (e.g. biology anatomy)
fldstextAll field values concatenated, separated by the \x1f (ASCII unit separator) character - field order matches the model definition in col.models
sfldtextValue of the sort field (used for browser sorting)
csumintegerNumeric checksum of the first field, used for duplicate detection
flagsintegerReserved flags
datatextExtra data blob, used by add-ons

Table: cards

One row per card. Multiple cards can share the same parent note (e.g. forward and reverse cards).

FieldTypeDescription
idintegerCard creation timestamp (Unix milliseconds); primary key
nidintegerForeign key → parent note id
didintegerForeign key → deck id in col.decks
ordintegerTemplate index (0-based) within the note type - identifies which card template generated this card
modintegerLast modification timestamp (Unix seconds)
usnintegerUpdate sequence number for sync
typeintegerCard state: 0 = new, 1 = learning, 2 = review, 3 = relearning
queueintegerScheduling queue: -3 = sched buried, -2 = user buried, -1 = suspended, 0 = new, 1 = learning, 2 = review, 3 = day-learn, 4 = preview
dueintegerDue date; meaning varies: for new cards = position, for review = days since collection creation, for learning = Unix timestamp
ivlintegerCurrent interval in days (negative value = seconds, used during learning steps)
factorintegerEase factor × 1000 (e.g. 2500 = 250% ease); used in legacy SM-2 scheduler
repsintegerTotal number of reviews
lapsesintegerNumber of times the card was answered "Again" after graduation
leftintegerRemaining learning steps (encoded: left % 1000 = reps left today)
odueintegerOriginal due date, used when card is in a filtered deck
odidintegerOriginal deck ID, used when card is in a filtered deck
flagsintegerUser-set colour flag (0–7)
datatextJSON blob for extra scheduler data — FSRS stores stability s and difficulty d here, e.g. {"s":15.3,"d":6.8}

Table: revlog

One row per individual review event. This is the primary source of data for FSRS optimisation.

FieldTypeDescription
idintegerReview timestamp (Unix milliseconds); primary key
cidintegerForeign key → card id
usnintegerUpdate sequence number for sync
easeintegerAnswer button pressed: 1 = Again, 2 = Hard, 3 = Good, 4 = Easy
ivlintegerInterval after this review (days; negative = seconds)
lastIvlintegerInterval before this review (days; negative = seconds)
factorintegerEase factor after this review × 1000
timeintegerTime taken to answer (milliseconds, capped at 60,000)
typeintegerReview type: 0 = learning, 1 = review, 2 = relearn, 3 = filtered, 4 = manual

Table: graves

Tracks deleted objects (cards, notes, decks) for sync reconciliation. This table is not collected - see the exclusions section below.

FieldTypeDescription
usnintegerUpdate sequence number at time of deletion
oidintegerOriginal ID of the deleted object
typeintegerObject type: 0 = card, 1 = note, 2 = deck

What is collected

We're collecting:

  • Card contents - to estimate population-level complexity and find lexical neighbors;
  • Anki review logs - to find memory patterns across neighbors and train a model that predicts how many reviews you'd realistically need to master a given word (the prediction target may change as the research progresses);
  • Languages you're proficient in (C1+) - L1 and other known languages transfer knowledge into new ones;
  • Your interests or domain - this affects your prior exposure, which shifts your "similarity" to certain words (e.g., as an AI researcher I'm more familiar with technical terms than biological ones);
  • Consent to include your submitted data in the final public dataset. There's very little research in this area precisely because no large public datasets exist on how personalization changes lexical complexity. A public dataset reflecting how memory responds to different vocabulary types would really strongly push the field forward.

How your data is handled

Fields fall into three buckets: collected as-is, collected with transformation, or excluded entirely. Reasoning is grouped by category rather than enumerated per field, since most exclusions share a single rationale.

Excluded entirely

The following fields carry no research signal and/or can act as a quasi-identifier (sync timestamps cluster by user device) ultimately making it possible to track patterns unrelated to language learning. Therefore, we remove these fields before your submission and never collect this data.

Sync metadata - only meaningful for AnkiWeb reconciliation.

  • col.usn, col.ls, col.mod
  • notes.usn, notes.mod
  • cards.usn, cards.mod
  • revlog.usn
  • The entire graves table

Internal deduplication identifiers - used by Anki's import/sync logic.

  • notes.guid (10-character globally unique identifier)
  • notes.csum (checksum of the first field)

User personal annotations - colour flags are user-private categorical signals (e.g. "card I'm anxious about"), and the add-on data blob has unknown structure and may contain anything an installed add-on chose to write.

  • cards.flags, notes.flags (user-set colour flags)
  • notes.data (arbitrary add-on data blob)

Redundant data

  • col.tags (global tag registry - already captured per-note via notes.tags after filtering). See below on further tag processing.

Note-type styling and templates - CSS and card templates can encode user customisation patterns and occasionally contain identifying comments.

  • col.models is reduced to field definitions only (field names, ordering); CSS, card templates, and template names are dropped before submission.

Collected with transformation

Timestamps - with per-user random offset. Every timestamp in a single user's submission is offset by the same random constant drawn at submission time. This preserves all inter-event intervals - the only temporal signal SRS uses - while destroying alignment with real wall-clock time. This stage eliminates inference of sleep schedule, work hours, timezone, and weekday patterns from review timing.

  • Applies to: revlog.id, col.crt, notes.id, cards.id
  • The offset is not stored.

Applied post submission, but before the final dataset is released.

Identifiers - synthetic per-user counters. Real Anki IDs (note, card, deck, model, review) are replaced with synthetic counters during processing. This means that a future Anki export from the same user cannot be cross-referenced against published submissions.

  • Applies to: notes.id, cards.id, cards.nid, cards.did, revlog.cid, notes.mid
  • The mapping from real to synthetic IDs is not retained.

Applied post submission, but before the final dataset is released.

Tags. Tags may be either linguistically useful (noun, irregular_verb, feminine) or idiosyncratic (for_friday_test, names). We're providing a way to remove tags which you don't want to include in the final dataset.

  • Applies to: notes.tags

Deck names - replaced with neutral identifiers. User-set deck names (which can leak topic, purpose, or personal context - e.g. "German for Anna's wedding") are replaced with neutral identifiers. Hierarchy is preserved; semantic content is not.

  • Applies to: col.decksname

Card content - passes through your review before submission. Before any data leaves your machine, you see a card-by-card preview of notes.flds content with a per-card include/exclude toggle. Excluded cards are removed from the submission entirely along with their review history and scheduling state. This is in-the-loop consent, not post-hoc redaction. We never see what you exclude.

  • Applies to: notes.flds, notes.sfld

Collected as-is

  • Scheduling state: cards.type, cards.queue, cards.due, cards.ivl, cards.factor, cards.reps, cards.lapses, cards.left, cards.ord, cards.odue, cards.odid
  • FSRS per-card memory state: cards.data (stability s, difficulty d)
  • Review events: revlog.ease, revlog.ivl, revlog.lastIvl, revlog.factor, revlog.time, revlog.type
  • FSRS configuration per deck: col.dconffsrsWeights, desiredRetention, ignoreRevlogsBeforeDate, newPerDay, maxIvl
  • Note-type structure: col.models → field definitions only (names, ordering)
  • Collection-level scheduler config: col.conf → scheduler version, timezone offset (rounded to nearest hour), sort field
  • Schema version: col.ver

Additional strategies

1. Automated PII pattern scanner

The pre-submission UI runs a local pattern scanner over card content and flags potential PII for the your's attention:

  • Email addresses;
  • Phone numbers (multiple formats and country conventions);
  • Long digit sequences (potential ID numbers, card numbers, dates of birth);
  • URLs;
  • Common given-name and surname lists for the languages being studied.

Flagged cards surface in the review UI with the trigger highlighted. You can decide whether to include or exclude.

2. Special-category content check

Card content is additionally scanned locally for terms in health, religious, and political categories. When matches are found above a small threshold, you're shown an additional explicit notice before submission, since under GDPR Article 9 these are special categories of personal data requiring more explicit consent.

3. Withdrawal mechanism

Each submission generates a one-way submission token retained by the participant. Submitting the token to a withdrawal endpoint removes the contribution from the working dataset and, where the public release has not yet occurred, from the eventual release - processed on a scheduled basis. Post-release withdrawal, scheduled for October 2026, is not supported.