Last updated 12 May 2026
An .apkg file is a ZIP archive containing an SQLite database and associated media files. The database (named collection.anki21b, collection.anki21, or collection.anki2 depending on the Anki version) holds all card data, review history, scheduling parameters, and configuration. This document describes every table and field exposed by that database, and specifies which fields are collected, excluded, or transformed before inclusion in the research dataset.
The dataset is intended for public release after the research is completed. The collection decisions and mitigations below are designed against that bar.
Each file exported from Anki has the following data inside.
| File | Description |
|---|---|
collection.anki21b | Main SQLite database (modern format) |
meta | Protobuf-encoded format version metadata |
media | JSON mapping of numbered filenames to original media filenames |
0, 1, 2, … | Actual media files (images, audio, etc.) |
Below is an overview of what is stored in the collection.anki21b SQLite database.
colA single-row table holding collection-level configuration. All deck presets, note types, and global settings are stored here as JSON blobs.
| Field | Type | Description |
|---|---|---|
id | integer | Always 1 - only one row exists |
crt | integer | Collection creation timestamp (Unix seconds) |
mod | integer | Last modification timestamp (Unix milliseconds) |
scm | integer | Schema modification timestamp |
ver | integer | Schema version number |
usn | integer | Update sequence number, used for AnkiWeb sync |
ls | integer | Last sync timestamp |
conf | JSON text | Global configuration: scheduler version, timezone offset, sort field, sort order, new card insertion order, etc. |
models | JSON text | All note types (models): field definitions, card templates, CSS styling |
decks | JSON text | All deck definitions: names, parent/child hierarchy, per-deck limits |
dconf | JSON text | Deck configuration presets - contains FSRS weights (fsrsWeights, desiredRetention, ignoreRevlogsBeforeDate) and legacy SM-2 ease settings |
tags | JSON text | Registry of all tags used across the collection |
dconf sub-fields (FSRS-relevant)| Sub-field | Description |
|---|---|
fsrsWeights | Array of 17-19 floats representing the optimised FSRS model parameters |
desiredRetention | Target retention rate, e.g. 0.9 for 90% |
ignoreRevlogsBeforeDate | ISO date string - review logs before this date are excluded from FSRS optimisation |
newPerDay | Maximum new cards per day |
maxIvl | Maximum interval in days |
notesOne row per note. A note is the source of content; one note can generate multiple cards via its template.
| Field | Type | Description |
|---|---|---|
id | integer | Note creation timestamp (Unix milliseconds); serves as primary key |
guid | text | Globally unique 10-character identifier, used to prevent duplicates on import |
mid | integer | Foreign key → note type ID in col.models |
mod | integer | Last modification timestamp (Unix seconds) |
usn | integer | Update sequence number for sync |
tags | text | Space-separated list of tags (e.g. biology anatomy) |
flds | text | All field values concatenated, separated by the \x1f (ASCII unit separator) character - field order matches the model definition in col.models |
sfld | text | Value of the sort field (used for browser sorting) |
csum | integer | Numeric checksum of the first field, used for duplicate detection |
flags | integer | Reserved flags |
data | text | Extra data blob, used by add-ons |
cardsOne row per card. Multiple cards can share the same parent note (e.g. forward and reverse cards).
| Field | Type | Description |
|---|---|---|
id | integer | Card creation timestamp (Unix milliseconds); primary key |
nid | integer | Foreign key → parent note id |
did | integer | Foreign key → deck id in col.decks |
ord | integer | Template index (0-based) within the note type - identifies which card template generated this card |
mod | integer | Last modification timestamp (Unix seconds) |
usn | integer | Update sequence number for sync |
type | integer | Card state: 0 = new, 1 = learning, 2 = review, 3 = relearning |
queue | integer | Scheduling queue: -3 = sched buried, -2 = user buried, -1 = suspended, 0 = new, 1 = learning, 2 = review, 3 = day-learn, 4 = preview |
due | integer | Due date; meaning varies: for new cards = position, for review = days since collection creation, for learning = Unix timestamp |
ivl | integer | Current interval in days (negative value = seconds, used during learning steps) |
factor | integer | Ease factor × 1000 (e.g. 2500 = 250% ease); used in legacy SM-2 scheduler |
reps | integer | Total number of reviews |
lapses | integer | Number of times the card was answered "Again" after graduation |
left | integer | Remaining learning steps (encoded: left % 1000 = reps left today) |
odue | integer | Original due date, used when card is in a filtered deck |
odid | integer | Original deck ID, used when card is in a filtered deck |
flags | integer | User-set colour flag (0–7) |
data | text | JSON blob for extra scheduler data — FSRS stores stability s and difficulty d here, e.g. {"s":15.3,"d":6.8} |
revlogOne row per individual review event. This is the primary source of data for FSRS optimisation.
| Field | Type | Description |
|---|---|---|
id | integer | Review timestamp (Unix milliseconds); primary key |
cid | integer | Foreign key → card id |
usn | integer | Update sequence number for sync |
ease | integer | Answer button pressed: 1 = Again, 2 = Hard, 3 = Good, 4 = Easy |
ivl | integer | Interval after this review (days; negative = seconds) |
lastIvl | integer | Interval before this review (days; negative = seconds) |
factor | integer | Ease factor after this review × 1000 |
time | integer | Time taken to answer (milliseconds, capped at 60,000) |
type | integer | Review type: 0 = learning, 1 = review, 2 = relearn, 3 = filtered, 4 = manual |
gravesTracks deleted objects (cards, notes, decks) for sync reconciliation. This table is not collected - see the exclusions section below.
| Field | Type | Description |
|---|---|---|
usn | integer | Update sequence number at time of deletion |
oid | integer | Original ID of the deleted object |
type | integer | Object type: 0 = card, 1 = note, 2 = deck |
We're collecting:
Fields fall into three buckets: collected as-is, collected with transformation, or excluded entirely. Reasoning is grouped by category rather than enumerated per field, since most exclusions share a single rationale.
The following fields carry no research signal and/or can act as a quasi-identifier (sync timestamps cluster by user device) ultimately making it possible to track patterns unrelated to language learning. Therefore, we remove these fields before your submission and never collect this data.
Sync metadata - only meaningful for AnkiWeb reconciliation.
col.usn, col.ls, col.modnotes.usn, notes.modcards.usn, cards.modrevlog.usngraves tableInternal deduplication identifiers - used by Anki's import/sync logic.
notes.guid (10-character globally unique identifier)notes.csum (checksum of the first field)User personal annotations - colour flags are user-private categorical signals (e.g. "card I'm anxious about"), and the add-on data blob has unknown structure and may contain anything an installed add-on chose to write.
cards.flags, notes.flags (user-set colour flags)notes.data (arbitrary add-on data blob)Redundant data
col.tags (global tag registry - already captured per-note via notes.tags after filtering). See below on further tag processing.Note-type styling and templates - CSS and card templates can encode user customisation patterns and occasionally contain identifying comments.
col.models is reduced to field definitions only (field names, ordering); CSS, card templates, and template names are dropped before submission.Timestamps - with per-user random offset. Every timestamp in a single user's submission is offset by the same random constant drawn at submission time. This preserves all inter-event intervals - the only temporal signal SRS uses - while destroying alignment with real wall-clock time. This stage eliminates inference of sleep schedule, work hours, timezone, and weekday patterns from review timing.
revlog.id, col.crt, notes.id, cards.idApplied post submission, but before the final dataset is released.
Identifiers - synthetic per-user counters. Real Anki IDs (note, card, deck, model, review) are replaced with synthetic counters during processing. This means that a future Anki export from the same user cannot be cross-referenced against published submissions.
notes.id, cards.id, cards.nid, cards.did, revlog.cid, notes.midApplied post submission, but before the final dataset is released.
Tags. Tags may be either linguistically useful (noun, irregular_verb, feminine) or idiosyncratic (for_friday_test, names). We're providing a way to remove tags which you don't want to include in the final dataset.
notes.tagsDeck names - replaced with neutral identifiers. User-set deck names (which can leak topic, purpose, or personal context - e.g. "German for Anna's wedding") are replaced with neutral identifiers. Hierarchy is preserved; semantic content is not.
col.decks → nameCard content - passes through your review before submission. Before any data leaves your machine, you see a card-by-card preview of notes.flds content with a per-card include/exclude toggle. Excluded cards are removed from the submission entirely along with their review history and scheduling state. This is in-the-loop consent, not post-hoc redaction. We never see what you exclude.
notes.flds, notes.sfldcards.type, cards.queue, cards.due, cards.ivl, cards.factor, cards.reps, cards.lapses, cards.left, cards.ord, cards.odue, cards.odidcards.data (stability s, difficulty d)revlog.ease, revlog.ivl, revlog.lastIvl, revlog.factor, revlog.time, revlog.typecol.dconf → fsrsWeights, desiredRetention, ignoreRevlogsBeforeDate, newPerDay, maxIvlcol.models → field definitions only (names, ordering)col.conf → scheduler version, timezone offset (rounded to nearest hour), sort fieldcol.verThe pre-submission UI runs a local pattern scanner over card content and flags potential PII for the your's attention:
Flagged cards surface in the review UI with the trigger highlighted. You can decide whether to include or exclude.
Card content is additionally scanned locally for terms in health, religious, and political categories. When matches are found above a small threshold, you're shown an additional explicit notice before submission, since under GDPR Article 9 these are special categories of personal data requiring more explicit consent.
Each submission generates a one-way submission token retained by the participant. Submitting the token to a withdrawal endpoint removes the contribution from the working dataset and, where the public release has not yet occurred, from the eventual release - processed on a scheduled basis. Post-release withdrawal, scheduled for October 2026, is not supported.