
LVTag Specification
Version 1.0
Created by Danslav Slavenskoj
Date: May 2025
Languages: 中文简体 中文繁體 Čeština Deutsch English Español Français Hrvatski 日本語 한국어 Polski Português Русский Српски
Quick Links
- JSON Schema - Full validation schema for LVTag format
- Classifier Definitions - Machine-readable classifier specifications
- Specification - Jump to format details
- Examples - See LVTag in action
Overview
The Language Variant Tag (LVTag) format is a systematic approach to language classification that extends the BCP 47 standard using private-use subtags. It enables precise identification of language varieties across multiple dimensions including formality, politeness, domain, and orthography.
Key Benefits
Classification Rigor: LVTag brings systematic organization to language tagging by providing clear, separate dimensions for different types of variation. Unlike existing subtags and systems that mix different categories at the same level, LVTag maintains strict separation between formality, politeness, domain, and other dimensions.
Standards Compatibility: LVTag is fully compliant with BCP 47 (RFC 5646) and works seamlessly with:
- IANA Language Subtag Registry
- ISO 639 language codes
- Unicode CLDR
- W3C language tags
- HTTP Accept-Language headers
- XML lang attributes
- HTML lang attributes
Technology Integration: LVTag tags can be used directly in:
- Natural Language Processing (NLP) pipelines
- Machine Translation systems
- Content Management Systems (CMS)
- Language detection libraries
- Search engines and information retrieval
- Web applications and APIs
- Localization workflows
Use Cases:
- Audience Targeting: Match content to appropriate audiences based on register and domain
- Translation Quality: Maintain appropriate formality and politeness levels in machine translation
- Language Learning: Teach learners appropriate register for different contexts
- Corpus Linguistics: Build precisely tagged corpora for research
- Social Media Analysis: Classify user-generated content by register and domain
- Customer Service: Route messages based on formality and domain to appropriate agents
Rationale
While BCP 47 provides excellent support for identifying languages, scripts, and regions, it lacks standardized mechanisms for capturing sociolinguistic variation within a language. Current standards don’t address:
- Register Variation: No way to distinguish between formal and informal varieties of the same language
- Politeness Levels: Critical for languages like Japanese, Korean, and Thai where politeness is grammatically encoded
- Domain-Specific Language: No standard for marking technical, medical, or legal language varieties
- Sociolects: No mechanism for identifying social group varieties (youth language, professional jargon)
- Historical Stages: Limited support for distinguishing classical from modern forms
- Formality Gradients: No numeric scale for computational processing of register
- Proto-Languages: Inconsistent encoding - some proto-languages have ISO codes (e.g.,
ine
for PIE) while others don’t, and ISO 639-5 family codes aren’t valid in BCP 47 tags, creating a confusing landscape for historical linguistics - Orthographic Variation: While BCP 47 handles scripts, it doesn’t effectively capture variations within scripts (spelling reforms, romanization systems, competing standards) that fundamentally affect text processing, search, and spell-checking
LVTag fills these gaps using BCP 47’s private-use extension mechanism (-x-
), providing a systematic, machine-readable way to encode these critical dimensions of language variation while maintaining full backward compatibility.
Precise Language Classification
The advent of large language models and sophisticated NLP tools has made precise language variety classification not just useful but essential. Modern systems need to:
- Generate text appropriate to specific contexts (formal vs. informal, polite vs. casual)
- Train on properly classified corpora to avoid mixing registers inappropriately
- Provide culturally and contextually appropriate responses
- Handle code-switching and mixed-language content accurately
- Preserve stylistic consistency when translating or transforming text
- Filter training data based on formality, domain, or other characteristics
- Adapt output to match user preferences or requirements
LVTag provides the granular metadata need to understand not just what language is being used, but how it’s being used, enabling more nuanced and appropriate language processing pipelines.
Format Specification
Basic Structure
language-x-[classifier]-[value]-[classifier2]-[value2]...
Where:
language
is a valid BCP 47 primary language subtag (e.g.,en
,ko
,ja
)x
indicates the beginning of private-use subtagsclassifier
is a category identifier (see Magic Tags below)value
is the specific classification within that category
Magic Tags
LVTag supports both long-form and short-form “magic” classifiers for flexibility:
Long Form | Short Form | Description |
---|---|---|
ortho |
w |
Orthographic variant |
form |
f |
Formality level (1-5 scale) |
polite |
p |
Politeness/respect level (1-5 scale) |
domain |
d |
Specialized vocabulary or professional context |
geo |
g |
Geographic or regional variety |
proto |
a |
Proto-language or reconstructed language |
hist |
h |
Historical period or stage of a language |
genre |
e |
Text genre or literary style |
medium |
m |
Communication medium (spoken, written, digital) |
socio |
s |
Sociolect or social group variety |
modality |
o |
Mode of language production |
register |
r |
Linguistic register |
pragma |
u |
Communicative function |
temporal |
t |
Temporal marking |
evidence |
v |
Information source |
affect |
k |
Emotional tone |
age |
n |
Age/generation variety |
gender |
i |
Gender variety |
expert |
b |
Expertise level |
interact |
2 |
Interactional structure |
prosody |
y |
Prosodic features |
lexical |
l |
Lexical density (0-100) |
syntax |
z |
Syntactic complexity (0-100) |
start |
0 |
Start date (ISO 8601 without punctuation) |
end |
1 |
End date (ISO 8601 without punctuation) |
taboo |
j |
Taboo/vulgar content level (0-5 scale) |
conf |
c |
Confidence score (0-100) for previous tag |
— | q , 3 -9 |
Reserved for future use |
Classifiers
1. Orthography Classifier (ortho
or w
)
Identifies specific orthographic conventions or writing system variants beyond standard script tags.
Format:
- Long:
language-x-ortho-[variant]
- Short:
language-x-w-[variant]
Examples (combined with standard script tags):
az-Latn-x-ortho-new
oraz-Latn-x-w-new
- Azerbaijani Latin script, new orthographyde-Latn-x-ortho-1901
orde-Latn-x-w-1901
- German Latin script, 1901 orthographyzh-Hans-x-ortho-pinyin
orzh-Hans-x-w-pinyin
- Simplified Chinese with Pinyinyi-Hebr-x-ortho-yivo
oryi-Hebr-x-w-yivo
- Yiddish Hebrew script, YIVO orthography
2. Formality Classifier (form
or f
)
Identifies the formality level of language use.
Format:
- Long:
language-x-form-[1-5]
- Short:
language-x-f-[1-5]
Formality scale:
- 1 = Most formal (written documents, official speeches)
- 2 = Formal (business meetings, academic writing)
- 3 = Neutral/standard (news, general conversation)
- 4 = Informal (casual conversation, emails to friends)
- 5 = Most casual (intimate conversation, slang)
Examples:
ko-x-form-1
orko-x-f-1
- Most formal Koreanen-x-form-3
oren-x-f-3
- Neutral Englishja-x-form-5
orja-x-f-5
- Most casual Japanese
3. Politeness Classifier (polite
or p
)
Identifies the politeness/respect level of language use.
Format:
- Long:
language-x-polite-[1-5]
- Short:
language-x-p-[1-5]
Politeness scale:
- 1 = Most respectful/deferential (royal address, religious contexts)
- 2 = Very polite (formal honorifics, respectful speech)
- 3 = Polite/neutral (standard politeness)
- 4 = Familiar (among equals, friends)
- 5 = Intimate/plain (family, very close friends)
Examples:
ko-x-polite-1
orko-x-p-1
- Highest respect Koreanja-x-polite-2
orja-x-p-2
- Very polite Japaneseth-x-polite-3
orth-x-p-3
- Standard polite Thai
4. Domain Classifier (domain
or d
)
Identifies specialized vocabulary or professional context.
Format:
- Long:
language-x-domain-[domain_type]
- Short:
language-x-d-[domain_type]
Examples:
en-x-domain-legal
oren-x-d-legal
- Legal Englishja-x-domain-med
orja-x-d-med
- Medical Japaneseko-x-domain-business
orko-x-d-business
- Business Koreanja-x-domain-tech
orja-x-d-tech
- Technical Japaneseen-x-domain-fin
oren-x-d-fin
- Financial English
5. Geographic Classifier (geo
or g
)
Identifies regional or geographic language varieties.
Format:
- Long:
language-x-geo-[region]
- Short:
language-x-g-[region]
Examples:
ko-x-geo-gyeong
orko-x-g-gyeong
- Gyeongsang Korean (경상도)ko-x-geo-jeolla
orko-x-g-jeolla
- Jeolla Korean (전라도)es-x-geo-riopla
ores-x-g-riopla
- Rioplatense Spanishpt-x-geo-nordeste
orpt-x-g-nordeste
- Northeastern Brazilian Portuguese
6. Proto Classifier (proto
or a
)
Identifies proto-languages or reconstructed historical languages.
Format:
- Long:
x-proto-[iso639-5_code if available]
- Short:
x-a-[iso639-5_code if available]
Rules:
- MUST use ISO 639-5 language family codes when available
- Use descriptive identifiers only when no ISO 639-5 code exists
Examples using ISO 639-5 codes:
x-proto-ine
orx-a-ine
- Proto-Indo-Europeanx-proto-gem
orx-a-gem
- Proto-Germanicx-proto-sla
orx-a-sla
- Proto-Slavicx-proto-sem
orx-a-sem
- Proto-Semiticx-proto-cel
orx-a-cel
- Proto-Celticx-proto-ira
orx-a-ira
- Proto-Iranianx-proto-inc
orx-a-inc
- Proto-Indo-Aryanx-proto-bat
orx-a-bat
- Proto-Balticx-proto-roa
orx-a-roa
- Proto-Romancex-proto-trk
orx-a-trk
- Proto-Turkic
Examples without ISO 639-5 codes (descriptive, longer than three characters):
x-proto-baltslav
orx-a-baltslav
- Proto-Balto-Slavic (no ISO 639-5 code)
Note:
- Language family codes (ISO 639-5) are NOT valid as standard primary BCP 47 language tags which is why we have implemented them using x-proto
- They are valid and preferred within private-use extensions (after
x-
) - Therefore all proto-language tags must start with
x-
to comply with BCP 47
7. Historic Classifier (hist
or h
)
Identifies historical periods or stages of a language.
Format:
- Long:
language-x-hist-[period]
- Short:
language-x-h-[period]
Examples:
en-x-hist-old
oren-x-h-old
- Old English perioden-x-hist-middle
oren-x-h-middle
- Middle English periodja-x-hist-kobun
orja-x-h-kobun
- Classical Japanese (古文)ko-x-hist-hunmin
orko-x-h-hunmin
- Middle Korean (훈민정음 period)el-x-hist-koine
orel-x-h-koine
- Koine Greek (Κοινή)sa-x-hist-vedic
orsa-x-h-vedic
- Vedic Sanskrit (वैदिक)
8. Genre Classifier (genre
or e
)
Identifies text genre or literary style.
Format:
- Long:
language-x-genre-[genre_type]
- Short:
language-x-e-[genre_type]
Examples:
en-x-genre-news
oren-x-e-news
- News Englishja-x-genre-manga
orja-x-e-manga
- Manga Japanese (漫画)ko-x-genre-webtoon
orko-x-e-webtoon
- Korean webtoon (웹툰)zh-x-genre-shi
orzh-x-e-shi
- Chinese poetry (詩)fr-x-genre-bd
orfr-x-e-bd
- French comics (bande dessinée)de-x-genre-marchen
orde-x-e-marchen
- German fairy tales (Märchen)
9. Medium Classifier (medium
or m
)
Identifies the communication medium.
Format:
- Long:
language-x-medium-[medium_type]
- Short:
language-x-m-[medium_type]
Examples:
en-x-medium-spoken
oren-x-m-spoken
- Spoken Englishko-x-medium-digital
orko-x-m-digital
- Digital/online Koreanja-x-medium-written
orja-x-m-written
- Written Japanesehi-x-medium-bcast
orhi-x-m-bcast
- Broadcast Hindizh-x-medium-sms
orzh-x-m-sms
- SMS/text message Chinese
10. Socio Classifier (socio
or s
)
Identifies sociolect or social group varieties.
Format:
- Long:
language-x-socio-[social_group]
- Short:
language-x-s-[social_group]
Examples:
en-x-socio-academic
oren-x-s-academic
- Academic sociolecten-x-socio-urban
oren-x-s-urban
- Urban sociolectes-x-socio-juvenil
ores-x-s-juvenil
- Spanish youth sociolect (jerga juvenil)fr-x-socio-jeune
orfr-x-s-jeune
- French youth sociolectde-x-socio-jugend
orde-x-s-jugend
- German youth sociolect (Jugendsprache)ko-x-socio-online
orko-x-s-online
- Korean online sociolect
11. Modality Classifier (modality
or o
)
Identifies the fundamental mode of language production.
Format:
- Long:
language-x-modality-[mode]
- Short:
language-x-o-[mode]
Examples:
en-x-modality-spoken
oren-x-o-spoken
- Spoken Englishen-x-modality-written
oren-x-o-written
- Written Englishasl-x-modality-signed
orasl-x-o-signed
- American Sign Languageen-x-modality-multi
oren-x-o-multi
- Multimodal English (speech + gestures)fr-x-modality-tactile
orfr-x-o-tactile
- Tactile French (for deafblind)
12. Register Classifier (register
or r
)
Identifies the linguistic register or functional variety of language use.
Format:
- Long:
language-x-register-[register_type]
- Short:
language-x-r-[register_type]
Examples:
en-x-register-frozen
oren-x-r-frozen
- Frozen register (prayers, pledges)en-x-register-formal
oren-x-r-formal
- Formal register (academic papers)en-x-register-consult
oren-x-r-consult
- Consultative register (professional)en-x-register-casual
oren-x-r-casual
- Casual register (friends)en-x-register-intimate
oren-x-r-intimate
- Intimate register (family)
13. Pragmatic Function Classifier (pragma
or u
)
Identifies the communicative function or speech act.
Format:
- Long:
language-x-pragma-[function]
- Short:
language-x-u-[function]
Examples:
en-x-pragma-request
oren-x-u-request
- Request functionja-x-pragma-apology
orja-x-u-apology
- Apology functiones-x-pragma-complmnt
ores-x-u-complmnt
- Compliment functionar-x-pragma-greeting
orar-x-u-greeting
- Greeting functionzh-x-pragma-refusal
orzh-x-u-refusal
- Refusal function
14. Temporal Marking Classifier (temporal
or t
)
Identifies temporal aspects or tense usage patterns.
Format:
- Long:
language-x-temporal-[aspect]
- Short:
language-x-t-[aspect]
Examples:
en-x-temporal-past
oren-x-t-past
- Past-oriented discourseja-x-temporal-nonpast
orja-x-t-nonpast
- Non-past focusid-x-temporal-atemprl
orid-x-t-atemprl
- Timeless/atemporalfr-x-temporal-future
orfr-x-t-future
- Future-orientedzh-x-temporal-aspect
orzh-x-t-aspect
- Aspectual focus
15. Evidentiality Classifier (evidence
or v
)
Identifies information source marking.
Format:
- Long:
language-x-evidence-[source]
- Short:
language-x-v-[source]
Examples:
qu-x-evidence-direct
orqu-x-v-direct
- Direct witnesstr-x-evidence-hearsay
ortr-x-v-hearsay
- Hearsay/reportedja-x-evidence-infer
orja-x-v-infer
- Inferentialen-x-evidence-assume
oren-x-v-assume
- Assumedde-x-evidence-quote
orde-x-v-quote
- Quotative
16. Affect/Emotion Classifier (affect
or k
)
Identifies emotional tone or affect.
Format:
- Long:
language-x-affect-[emotion]
- Short:
language-x-k-[emotion]
Examples:
en-x-affect-angry
oren-x-k-angry
- Angry toneja-x-affect-humble
orja-x-k-humble
- Humble affectes-x-affect-joyful
ores-x-k-joyful
- Joyful expressionko-x-affect-sad
orko-x-k-sad
- Sad/melancholicfr-x-affect-neutral
orfr-x-k-neutral
- Neutral affect
17. Age/Generation Classifier (age
or n
)
Identifies age-related or generational language varieties.
Format:
- Long:
language-x-age-[generation]
- Short:
language-x-n-[generation]
Examples:
en-x-age-child
oren-x-n-child
- Child speechja-x-age-teen
orja-x-n-teen
- Teenager languageko-x-age-elder
orko-x-n-elder
- Elder speeches-x-age-genz
ores-x-n-genz
- Generation Zzh-x-age-millenl
orzh-x-n-millenl
- Millennial speech
18. Gender Classifier (gender
or i
)
Identifies gender related language varieties.
Format:
- Long:
language-x-gender-[identity]
- Short:
language-x-i-[identity]
19. Expertise Level Classifier (expert
or b
)
Identifies level of domain expertise on a 0-10 scale.
Format:
- Long:
language-x-expert-[0-10]
- Short:
language-x-b-[0-10]
Expertise scale:
- 0 = No knowledge
- 1-2 = Beginner
- 3-4 = Intermediate
- 5-6 = Advanced
- 7-8 = Expert
- 9-10 = Master/Authority
Examples:
en-x-expert-0
oren-x-b-0
- No expertisede-x-expert-3
orde-x-b-3
- Intermediate levelja-x-expert-7
orja-x-b-7
- Expert leveles-x-expert-9
ores-x-b-9
- Master levelzh-x-expert-5
orzh-x-b-5
- Advanced level
20. Interactional Structure Classifier (interact
or 2
)
Identifies conversational or interactional patterns.
Format:
- Long:
language-x-interact-[structure]
- Short:
language-x-2-[structure]
Examples:
en-x-interact-turn
oren-x-2-turn
- Turn-takingja-x-interact-overlap
orja-x-2-overlap
- Overlapping speeches-x-interact-monolog
ores-x-2-monolog
- Monologicar-x-interact-dialog
orar-x-2-dialog
- Dialogiczh-x-interact-multi
orzh-x-2-multi
- Multi-party
21. Prosodic Features Classifier (prosody
or y
)
Identifies prosodic or suprasegmental features.
Format:
- Long:
language-x-prosody-[feature]
- Short:
language-x-y-[feature]
Examples:
en-x-prosody-stress
oren-x-y-stress
- Stress-timedja-x-prosody-pitch
orja-x-y-pitch
- Pitch-accentfr-x-prosody-syllable
orfr-x-y-syllable
- Syllable-timedzh-x-prosody-tone
orzh-x-y-tone
- Tonal patternses-x-prosody-rhythm
ores-x-y-rhythm
- Rhythmic patterns
22. Lexical Density Classifier (lexical
or l
)
Identifies lexical density as a numeric value (0-100).
Format:
- Long:
language-x-lexical-[0-100]
- Short:
language-x-l-[0-100]
Examples:
en-x-lexical-20
oren-x-l-20
- Low density (20%)de-x-lexical-55
orde-x-l-55
- Medium density (55%)ja-x-lexical-75
orja-x-l-75
- High density (75%)es-x-lexical-40
ores-x-l-40
- Moderate density (40%)zh-x-lexical-85
orzh-x-l-85
- Very high density (85%)
23. Syntactic Complexity Classifier (syntax
or z
)
Identifies syntactic complexity as a numeric value (0-100).
Format:
- Long:
language-x-syntax-[0-100]
- Short:
language-x-z-[0-100]
Examples:
en-x-syntax-15
oren-x-z-15
- Simple syntax (15%)de-x-syntax-70
orde-x-z-70
- Complex syntax (70%)ja-x-syntax-45
orja-x-z-45
- Moderate complexity (45%)es-x-syntax-30
ores-x-z-30
- Low complexity (30%)zh-x-syntax-60
orzh-x-z-60
- High complexity (60%)
24. Start Date Classifier (start
or 0
)
Identifies the start date of language use (ISO 8601 format without punctuation).
Format:
- Long:
language-x-start-[YYYYMMDD]
- Short:
language-x-0-[YYYYMMDD]
Date formats:
- Full date: YYYYMMDD
- Year-month: YYYYMM
- Year only: YYYY
Examples:
en-x-start-20240315
oren-x-0-20240315
- English starting March 15, 2024ja-x-start-19890108
orja-x-0-19890108
- Japanese starting January 8, 1989es-x-start-202403
ores-x-0-202403
- Spanish starting March 2024
25. End Date Classifier (end
or 1
)
Identifies the end date of language use (ISO 8601 format without punctuation).
Format:
- Long:
language-x-end-[YYYYMMDD]
- Short:
language-x-1-[YYYYMMDD]
Date formats:
- Full date: YYYYMMDD
- Year-month: YYYYMM
- Year only: YYYY
Examples:
en-x-end-20240415
oren-x-1-20240415
- English ending April 15, 2024ja-x-end-20190430
orja-x-1-20190430
- Japanese ending April 30, 2019es-x-end-202412
ores-x-1-202412
- Spanish ending December 2024
26. Taboo Classifier (taboo
or j
)
Identifies level of taboo, vulgar, or offensive content.
Format:
- Long:
language-x-taboo-[0-5]
- Short:
language-x-j-[0-5]
Examples:
en-x-taboo-0
oren-x-j-0
- No taboo contenten-x-taboo-3
oren-x-j-3
- Moderate taboo levelja-x-form-5-taboo-4
orja-x-f-5-j-4
- Very casual Japanese with high taboo level
27. Confidence Classifier (conf
or c
)
Indicates confidence score for the immediately preceding classifier.
Format:
- Long:
language-x-[classifier]-[value]-conf-[0-100]
- Short:
language-x-[classifier]-[value]-c-[0-100]
Special behavior:
- The confidence score applies to the classifier immediately before it
- Multiple confidence scores can be used for different classifiers
- If no classifier precedes it, the confidence applies to the base language tag
Examples:
en-x-form-3-conf-95
oren-x-f-3-c-95
- Neutral formality with 95% confidenceko-x-polite-2-conf-80-domain-med-conf-60
orko-x-p-2-c-80-d-med-c-60
- Very polite (80% confidence) medical Korean (60% confidence)ja-x-hist-kobun-conf-100
orja-x-h-kobun-c-100
- Classical Japanese with 100% confidencex-proto-ine-conf-75
orx-a-ine-c-75
- Proto-Indo-European with 75% confidence
Multiple Classifications
LVTag supports multiple classifiers in a single tag to provide precise language identification. Both long and short forms can be mixed:
ko-x-form-4-domain-business
ko-x-f-4-d-business
ko-x-form-4-polite-2-domain-business
ko-x-f-4-p-2-d-business
Examples above show Korean with informal formality (4) but polite speech (2) in business context.
Valid Values
Note: All values must be 8 characters or shorter to comply with BCP 47 subtag length restrictions. While specific values for many classifiers are to be established through expert usage and community consensus, the numeric scales, date formats, and basic values listed below are defined in this standard.
Formality Scale (Universal)
Level | Description | Examples |
---|---|---|
1 | Most formal | Legal documents, official ceremonies, academic papers |
2 | Formal | Business letters, news articles, presentations |
3 | Neutral | Standard conversation, email, general writing |
4 | Informal | Casual conversation, personal blogs, text messages |
5 | Most casual | Slang, intimate conversation, social media |
Politeness Scale (Universal)
Level | Description | Examples |
---|---|---|
1 | Most respectful | Royal address, religious leaders, elderly respect |
2 | Very polite | Customer service, formal meetings, teachers |
3 | Polite/neutral | Standard interactions, colleagues |
4 | Familiar | Friends, peers, casual acquaintances |
5 | Intimate/plain | Close family, intimate partners |
Expertise Scale (Universal)
Level | Description |
---|---|
0 | No knowledge |
1-2 | Beginner |
3-4 | Intermediate |
5-6 | Advanced |
7-8 | Expert |
9-10 | Master/Authority |
Taboo Scale (Universal)
Level | Description |
---|---|
0 | No taboo content |
1 | Mild taboo |
2 | Light taboo |
3 | Moderate taboo |
4 | High taboo |
5 | Extreme taboo |
Lexical Density Scale (Universal)
Level | Description |
---|---|
0-20 | Very low density |
21-40 | Low density |
41-60 | Moderate density |
61-80 | High density |
81-100 | Very high density |
Syntactic Complexity Scale (Universal)
Level | Description |
---|---|
0-20 | Very simple |
21-40 | Simple |
41-60 | Moderate complexity |
61-80 | Complex |
81-100 | Very complex |
Domain Values
Value | Description |
---|---|
legal |
Legal terminology |
med |
Medical terminology |
tech |
Technical/IT |
business |
Business/corporate |
fin |
Finance/banking |
acad |
Academic/scholarly |
sci |
Scientific/research |
Implementation Examples
Single Classifier (Long Form)
# Most formal Korean
ko-x-form-1
# Very polite Japanese
ja-x-polite-2
# Legal English
en-x-domain-legal
# Gyeongsang Korean
ko-x-geo-gyeong
# Proto-Indo-European
x-proto-ine
Single Classifier (Short Form)
# Most formal Korean
ko-x-f-1
# Very polite Japanese
ja-x-p-2
# Legal English
en-x-d-legal
# Gyeongsang Korean
ko-x-g-gyeong
# Proto-Indo-European
x-a-ine
Multiple Classifiers
# Informal but polite Korean business language
ko-x-form-4-polite-2-domain-business
ko-x-f-4-p-2-d-business
# Formal and respectful Japanese medical language
ja-x-form-1-polite-1-domain-med
ja-x-f-1-p-1-d-med
# Southern Vietnamese with neutral formality, polite speech, technical domain
vi-x-geo-southern-form-3-polite-2-domain-tech
vi-x-g-southern-f-3-p-2-d-tech
# Complex classification with multiple dimensions
en-x-h-middle-e-poetry-m-written-f-1
ja-x-f-2-p-1-d-med-h-kobun-m-written
# Language varieties showing formality/politeness distinction
ko-x-f-5-p-2 # Very casual but polite (to older friend)
ko-x-f-1-p-4 # Very formal but familiar (written to peer)
ja-x-f-4-p-1 # Casual formality but highest respect
en-x-f-5-j-4 # Very casual English with high taboo level
Use Cases
- Language Learning Applications
- Teach appropriate register for different social contexts
- Provide domain-specific vocabulary training
- Machine Translation
- Maintain register consistency in translations
- Apply domain-specific terminology
- Content Classification
- Automatically categorize text by formality and domain
- Route content to appropriate reviewers or systems
- Corpus Linguistics
- Build tagged corpora for linguistic research
- Study register and domain variation
Validation Rules
- Subtag Length: Each subtag after
x-
must be 8 characters or fewer - Order: Classifiers can appear in any order after
x-
- Uniqueness: Each classifier type should appear only once per tag (except
conf
which can appear multiple times) - Case: Tags should be lowercase (case-insensitive per BCP 47)
- Magic Tags: Short form tags are single characters;
q
,3
-9
are reserved for future use - Mixing: Long and short forms can be mixed within the same tag
- Proto Tags: Must start with
x-
and SHOULD use ISO 639-5 codes when available (e.g.,x-proto-sla
notx-proto-slavic
) - Confidence: The
conf
/c
classifier applies to the immediately preceding classifier - Numeric Values: Must be within defined ranges (0-5 for taboo, 0-10 for expertise, 0-100 for percentage values)
- Date Format: Dates use ISO 8601 without punctuation (YYYY, YYYYMM, or YYYYMMDD)
Compatibility
LVTag format is fully compatible with:
- BCP 47 (RFC 5646)
- ISO 639 language codes
- IANA Language Subtag Registry
- Unicode CLDR
Benefits
- Precision: Enables fine-grained language variety identification
- Extensibility: New registers and domains can be added
- Standards-based: Built on established BCP 47 private-use mechanism
- Machine-readable: Systematic format enables automated processing
- Human-readable: Clear, descriptive subtags
- Flexibility: Support for both verbose long-form and concise short-form tags
- Brevity: Short magic tags enable compact representation while maintaining clarity
Future Extensions
LVTag is designed to evolve with the needs of the language technology community. We welcome suggestions for new classifiers, improvements to existing ones, and real-world implementation feedback.
To propose extensions or contribute to the specification:
- Open an issue at github.com/lvtag/spec
- Join the discussion on existing proposals
- Share your implementation experiences
- Submit pull requests for documentation improvements
Reserved single-character codes (q
, 3
-9
) are available for future standardized extensions.
References
License and Patent Grant
This specification is released under the CC0 1.0 Universal (Public Domain Dedication).
Why CC0: To ensure maximum adoption and implementation freedom, LVTag is placed in the public domain. This means:
- No permission needed to use, implement, or modify
- No attribution required (though appreciated)
- No legal barriers for commercial or governmental use
- Compatible with all software licenses
- Used by major standards like Unicode CLDR
Patent Grant: Any patents covering the LVTag specification are hereby licensed royalty-free for any implementation that complies with this specification.
No Endorsement: Use of LVTag does not imply endorsement by the specification authors.
To the extent possible under law, Danslav Slavenskoj has waived all copyright and related or neighboring rights to the Language Variant Tag (LVTag) Format Specification. This work is published from: United States of America.