The models powering most AI agents today were trained primarily on English web data. They miss Korean honorific structures, regional occupation patterns, and the cultural context that Korean users expect. An agent that applies U.S. healthcare workflows to the Korean public health system isn’t ready for production.
Nemotron-Personas-Korea fixes this. The dataset provides 6 million fully synthetic personas grounded in official statistics and seed data from the Korean Statistical Information Service (KOSIS), the Supreme Court of Korea, the National Health Insurance Service, and the Korea Rural Economic Institute. NAVER Cloud contributed seed data and domain expertise during design.
Every persona is demographically accurate but contains zero personally identifiable information (PII). It’s designed with Korea’s Personal Information Protection Act (PIPA) in mind. South Korea is also one of the few countries to publish an official Synthetic Data Generation guide, establishing governance for grounding models with synthetic versions of sensitive data. This dataset follows that approach.
In this tutorial, we’ll turn a synthetic persona into a deployed Korean agent — from filtering the dataset to inference — in about 20 minutes using hosted APIs.
A Sovereign Dataset for South Korea
| Attribute | Detail |
|---|---|
| Total personas | 7 million (1 million records × 7 personas each) |
| Persona fields | 26 fields: 7 persona fields, 6 persona attribute fields, 12 demographic & geographic contextual fields, and 1 unique identifier |
| Geographic coverage | All 17 Korean provinces, and 25 districts |
| Names | ~209K unique names (118 surnames, ~21.4K given names) |
| Occupations | 2K+ categories reflecting tech, manufacturing, public sector, etc. |
| Persona types | Professional, family, sports, arts, travel, culinary, concise |
| Life stages | Student, military service, employed, unemployed, retired |
| Language | Natural Korean |
| License | CC BY 4.0 |
Nemotron-Personas-Korea was generated using NeMo Data Designer, NVIDIA’s open-source compound AI system for synthetic data. The pipeline pairs a Probabilistic Graphical Model (Apache-2.0) for statistical grounding with Gemma-4-31B for Korean-language narrative generation. Population data comes from KOSIS (2020–2026 releases); name distributions come from the Supreme Court of Korea via namechart.kr.

Nemotron-Personas-Korea is the latest addition to the Nemotron-Personas Collection, which also covers the USA, Japan, India, Singapore (with AI Singapore), Brazil (with WideLabs), and France (with Pleias). If you’re building a multilingual agent that serves Korean users alongside other markets, you can blend personas across countries in the same pipeline.
Why This Matters for Autonomous Agents
Most agents today are identity-blind. They follow instructions without any grounding in who they’re serving. For example, an agent that books a Korean hospital appointment using US scheduling conventions, or addresses a 60-year-old patient in 반말 (“banmal,” informal language), doesn’t just feel wrong. It fails.
Nemotron-Personas-Korea changes this by giving your agent a Korean operating context. Load a persona into the system prompt and the agent inherits that persona’s region, occupation, communication norms, and domain expertise.
This works across any agent framework. Deploy with NemoClaw (NVIDIA’s open-source reference stack for always-on agents running in NVIDIA OpenShell sandboxes, on anything from RTX PCs to DGX Spark), serve through NVIDIA NIM for production inference, or call the NVIDIA API directly. The persona layer is framework-agnostic, acting as a well-structured system prompt grounded in real Korean demographics.
Tutorial: From Synthetic Persona to Sovereign Agent
🔗 Resources
Step 1: Load and Explore the Dataset
Load the dataset and explore what’s available. Each record contains structured demographic fields alongside rich, natural-language persona narratives.
from datasets import load_dataset
dataset = load_dataset("nvidia/Nemotron-Personas-Korea")
print(dataset["train"].column_names)
print(dataset["train"][0])
Step 2: Filter and Select a Persona
Filter the dataset by occupation, region, age, or any combination of fields to find personas that match your target domain. Here we’ll build a Korean public health agent.
health_personas = dataset["train"].filter(
lambda x: "보건" in x["occupation"] or "간호" in x["occupation"] or "의료" in x["occupation"]
)
print(f"Found {len(health_personas)} health personas")
persona = health_personas[0]
print(persona)
You can refine further by region (e.g., only Jeju-based health workers), education level, or life stage. The dataset is large enough to find highly specific slices.
Step 3: Define Your Agent Behavior
This is where persona data becomes agent behavior. The structured fields — name, region, occupation, skills — become the agent’s identity. You layer behavioral instructions and task scope on top. The result is an agent that reasons like a Korean professional in a specific role and region.
system_prompt = f"""당신은 한국의 공중보건 상담 AI 에이전트입니다.
[신원] # Identity
- 이름: {persona['name']} # Name
- 지역: {persona['region']} # Region
- 직업: {persona['occupation']} # Occupation
- 전문분야: {persona['skills']} # Specialization
[행동 지침] # Behavior guidelines
- 한국어 존댓말을 사용하여 응답하세요. # Use formal Korean
- 지역 보건소 및 공공 의료 체계에 대한 안내를 제공하세요. # Guide on local clinics
- 한국 공중보건 정책과 절차를 기반으로 정확한 정보를 제공하세요. # Follow KR health policy
- 문화적 맥락을 고려하여 상담하세요. # Consider cultural context
[업무 범위] # Task scope
- 예방접종 일정 안내 # Vaccination scheduling
- 건강검진 절차 설명 # Health screening procedures
- 지역 보건 자원 연결 # Connect to local health resources
- 공중보건 관련 일반 상담 # General public health consultation
"""
Step 4: Deploy Your Agent
Connect your persona-grounded prompt to a model for inference. You have three options depending on your setup:
- NVIDIA API catalog — fastest way to test (shown below)
- NVIDIA NIM — self-hosted inference for production deployments
- NemoClaw — reference stack for deploying always-on agents, runs anywhere, including on RTX PCs through DGX Spark
from openai import OpenAI
client = OpenAI(
base_url="https://integrate.api.nvidia.com/v1",
api_key="nvapi-YOUR_KEY"
)
response = client.chat.completions.create(
model="nvidia/nemotron-nano-8b-v1",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": "독감 예방접종은 언제 맞아야 하나요?"}
],
temperature=0.7,
max_tokens=512
)
print(response.choices[0].message.content)
The same workflow applies to any domain. Swap the persona filter and task scope, and you have a new agent: a 금융 (“geum-yung,” finance) persona becomes a retail banking advisor, a 교육 (“gyoyug,” education) persona becomes a tutoring assistant, a 공무원 (“gongmuwon,” civil servant) persona becomes a government health services agent.
What Grounding Changes
Here’s the same question — “독감 예방접종은 언제 맞아야 하나요?” (When should I get a flu shot?) — answered with and without persona grounding.
| Without Personas | With Korean Health Worker Personas | |
|---|---|---|
| Language | Responds in English/generic Korean | Natural 존댓말 appropriate for health consultation |
| Content | References CDC/global guidance | References Korean 보건소 schedule, national vaccination program |
| Specificity | “Visit your local clinic” | “가까운 보건소에서 무료 접종이 가능합니다” with regional context |
| Trust | None | Cites Korean public health policy, uses professional medical Korean |
The persona goes beyond translation — it contextualizes and results in an agent your users will trust.
Come Build with Us in Seoul
NVIDIA Nemotron Developer Days comes to Seoul today and tomorrow, April 21–22, 2026 — the first time the event has been held outside GTC. Two days of activities, including technical sessions on sovereign AI and open models, plus a hands-on hackathon where you’ll have an opportunity to use Nemotron-Personas-Korea to build domain-specific Korean agents and a claw. 🦞
Join in person or via livestream. Share what you build for a chance to be featured in a future NVIDIA tutorial.

