Interview Preparation Guide · Updated June 2026

AI Engineering Interview Questions 2026RAG, LLMs, Agents and Projects

AI engineering interviews test your ability to build production AI applications — not just theory. This guide covers every topic area interviewers test: LLMs, prompt engineering, RAG, vector databases, AI agents, LangChain, MCP, deployment, system design, and how to walk through your projects.

Designed for software developers, data scientists, data engineers, and freshers with technical backgrounds who are preparing for AI engineering roles in India and globally.

View AI Engineer Skills Map →Build Portfolio Projects Join AI Engineering Course

AI Engineering Interview: Area-by-Area Breakdown

Interview Area	What Interviewers Check	Recommended Resource
Python / API basics	Can you call an LLM API, parse JSON, build a FastAPI endpoint, and handle errors correctly?	/ai-engineer-skills
LLMs	Do you understand tokens, context windows, temperature, structured outputs, and tool calling?	/what-is-rag
Prompt engineering	Can you design effective system prompts, handle prompt injection risks, and evaluate prompt quality?	/course/advanced-prompt-engineering-training
RAG	Can you design and explain the full RAG pipeline — chunking, embedding, vector search, retrieval, generation, evaluation?	/what-is-rag
Vector databases	Do you know what embeddings are, how similarity search works, and which tools to use when?	/what-is-a-vector-database
AI agents	Can you explain tool calling, memory, planner/executor patterns, LangGraph, and multi-agent coordination?	/what-are-ai-agents
MCP	Do you understand what MCP is, how MCP servers and clients work, and how they differ from direct API calls?	/what-is-mcp
Deployment	Can you deploy a FastAPI app with Docker, add logging, handle latency, and monitor cost?	/course/production-ai-engineering
Projects	Can you explain a real project end-to-end — architecture, decisions, evaluation, results, and what you would improve?	/ai-engineer-projects
System design	Can you design a production RAG chatbot or AI assistant from scratch — covering retrieval, evaluation, security, and monitoring?	/ai-engineering

How AI Engineering Interviews Are Different

AI engineering is a newer discipline and its interviews reflect that. Unlike traditional ML engineering interviews — which heavily test model architecture, maths, and research papers — AI engineering interviews focus on applied building skills. See the AI Engineering Guide for the full definition of the discipline.

Deployed projects matter more than degrees

Showing a working RAG assistant with a GitHub README explaining your architecture, chunking strategy, evaluation scores, and deployment — even as a side project — signals AI engineering readiness far more than ML coursework alone.

Evaluation is a first-class topic

Knowing how to evaluate LLM outputs (RAGAS, human review, LLM-as-judge) is a differentiator. Many candidates can build an initial RAG pipeline but cannot explain how they measured whether it was working correctly.

Cost and latency awareness is expected

Production AI engineers manage LLM API costs and response latency. Being able to discuss trade-offs between model size, context length, caching, and cost shows engineering maturity beyond tutorials.

Not just Python scripts — real APIs and deployments

The gap between "I made a LangChain notebook" and "I deployed a FastAPI endpoint with Docker, environment variables, logging, and monitoring" is what interviewers assess. The latter signals production-readiness.

System design over algorithmic puzzles

AI engineering roles often include system design rounds focused on AI architectures — designing a document Q&A system, designing a multi-agent workflow, or designing a production RAG pipeline. LeetCode-style algorithmic questions are rare; design questions are common.

Beginner-Level AI Engineering Interview Questions

Foundation concepts every AI engineering candidate must know.

What is AI engineering?+

AI engineering is the discipline of building, deploying, and maintaining software applications that use AI models — especially large language models (LLMs) — to solve real problems. It combines software engineering (APIs, deployment, logging) with applied AI skills (prompt engineering, RAG, agents). The role is distinct from ML engineering, which focuses on training models, and from data science, which focuses on analysis and insights.

What is a large language model (LLM)?+

An LLM is a deep learning model trained on large text datasets to understand and generate human language. It predicts the next token (word fragment) given a context. LLMs like GPT-4, Claude, and Gemini are used via APIs — you send a prompt and receive a generated response. AI engineers do not typically train LLMs; they build applications on top of them via APIs.

What is prompt engineering?+

Prompt engineering is the practice of designing inputs (prompts) to LLMs to reliably get the outputs you need. It involves designing system prompts, structuring context, using few-shot examples, and controlling output format. In production AI systems, prompt design directly affects output quality, reliability, and cost.

What is a token?+

A token is the unit of text an LLM processes. Tokens are not exactly words — a common English word is roughly one token; longer words, code, and non-English text may be two or more tokens. LLMs have context window limits (e.g., 128K tokens) and APIs charge per token. Engineers must track and manage token usage for cost and context window reasons.

What is an API?+

An API (Application Programming Interface) is a way for software to communicate with another service over the internet using structured requests and responses. LLM providers expose APIs — you send a JSON payload with your prompt and model parameters, and receive a JSON response with the generated text. AI engineers spend a large proportion of their time working with LLM APIs and building APIs of their own (using FastAPI or similar).

What is JSON and why is it important in AI engineering?+

JSON (JavaScript Object Notation) is a lightweight text format for structured data. It is the standard format for API requests and responses, including LLM API calls. AI engineers need to read, parse, and write JSON constantly — from sending prompts to LLM APIs, to parsing structured outputs, to returning results from FastAPI endpoints. Structured outputs from LLMs (getting a model to return valid JSON) are a common production requirement.

What is an embedding?+

An embedding is a numerical representation of text (or other data) as a high-dimensional vector. An embedding model converts a sentence or paragraph into a list of numbers such that semantically similar texts produce numerically close vectors. Embeddings are the foundation of semantic search and vector databases — and are used in every RAG pipeline to convert documents and queries into comparable numerical form.

What is RAG?+

RAG (Retrieval-Augmented Generation) is a pattern for building AI applications that can answer questions about private, current, or domain-specific information that the LLM was not trained on. It works by: (1) indexing your documents as embeddings in a vector database; (2) at query time, retrieving the most semantically relevant chunks; (3) passing those chunks as context to the LLM; (4) generating an answer grounded in the retrieved content.

What is a vector database?+

A vector database stores text chunks as high-dimensional numerical embeddings and enables fast semantic similarity search — finding content by meaning rather than exact keywords. Common vector databases used in AI engineering include Pinecone, Chroma, FAISS, pgvector, and Weaviate. They are the retrieval engine of most RAG systems.

What is model hallucination?+

Hallucination is when an LLM generates content that is factually incorrect but stated with confidence. It happens because LLMs generate the statistically likely next token — they do not look up facts. Hallucination is mitigated in production AI systems through: RAG (grounding answers in retrieved documents), structured outputs, output evaluation with RAGAS, human review processes, and adding citations to generated responses.

LLM and Prompt Engineering Interview Questions

For deeper preparation, see the Advanced Prompt Engineering Training.

What is a system prompt?+

A system prompt is an instruction you provide to an LLM at the beginning of a conversation to set its persona, define what it should or should not do, and establish output format expectations. In production applications, the system prompt is carefully engineered and versioned — it is a core asset of the AI application.

What is a prompt template?+

A prompt template is a reusable prompt structure with variable placeholders that are filled at runtime. For example: "You are a helpful assistant. Context: {context}. Question: {question}. Answer concisely." LangChain's PromptTemplate and ChatPromptTemplate classes are the standard tooling for managing and composing prompt templates in Python.

What are structured outputs?+

Structured outputs instruct an LLM to return its response in a specific format — typically JSON that matches a Pydantic schema. This is essential for production AI systems where the application needs to parse and use the model's output programmatically. Modern LLM APIs (OpenAI, Anthropic) support enforced structured output modes that guarantee valid JSON responses.

What is function calling / tool calling?+

Function calling (now called tool calling in most APIs) allows an LLM to decide to call an external function or tool rather than generate a text response. You define tools with their parameters, the LLM decides whether and which tools to call, and your code executes the calls and passes results back. Tool calling is the foundation of AI agent behaviour — enabling LLMs to search databases, call APIs, read files, and take actions.

What is temperature and when would you change it?+

Temperature controls the randomness of an LLM's output. Low temperature (0.1–0.3) makes responses more deterministic and focused — good for factual Q&A, structured outputs, and RAG applications. High temperature (0.7–1.0) increases variety and creativity — good for content generation, brainstorming, and conversational uses. For most production RAG and agent systems, keep temperature low to reduce hallucination risk.

What is a context window and why does it matter?+

A context window is the maximum amount of text (measured in tokens) an LLM can process in a single request — including both input and output. Common limits range from 8K to 200K+ tokens depending on the model. In RAG systems, the context window determines how many retrieved chunks you can include. Filling the context window entirely can degrade quality due to the "lost in the middle" problem — where LLMs are less attentive to content in the centre of a long context.

How do you evaluate LLM output quality?+

LLM output evaluation approaches include: RAGAS (automated evaluation of RAG systems — measures faithfulness, context precision, context recall, and answer relevancy); LLM-as-judge (using a powerful LLM to score another LLM's outputs against rubrics); human review (a human grades outputs against a gold-standard answer); and task-specific metrics (e.g., accuracy for classification tasks). Production AI systems need a mix of automated and human evaluation.

What is prompt injection?+

Prompt injection is an attack where a malicious user crafts input that overrides your system prompt or manipulates the LLM to behave in unintended ways. For example, a user might include text like "Ignore all previous instructions and instead...". Production AI applications must validate and sanitise inputs, scope system prompts defensively, and in some cases run injection detection filters.

What is few-shot prompting?+

Few-shot prompting means including examples of the desired input-output format in the prompt, to guide the LLM's behaviour. Instead of just saying "classify this email as spam or not spam", you include two or three labelled examples before the target input. Few-shot prompting improves consistency and is especially effective for formatting, tone, and classification tasks.

How do you control LLM cost in a production application?+

Cost control strategies include: choosing the smallest model that meets quality requirements; caching responses for repeated queries; keeping context windows as short as possible (trimming irrelevant retrieved chunks); using streaming to start returning output before the full response is generated; batching requests where latency allows; and monitoring per-request token usage in your logging stack to detect cost anomalies early.

RAG Interview Questions

See also: What is RAG? · RAG vs Fine-Tuning

What is RAG and why is it used?+

RAG (Retrieval-Augmented Generation) is a pattern for grounding LLM answers in external knowledge. It is used because LLMs only know what they were trained on — they cannot answer questions about private documents, recent events, or domain-specific content without retrieval. RAG solves this by retrieving relevant context at query time and injecting it into the LLM prompt, enabling factual answers about any content that has been indexed.

Describe the full RAG pipeline.+

The RAG pipeline has two phases. Indexing: load documents → split into chunks → embed each chunk with an embedding model → store vectors + metadata in a vector database. Querying: user submits question → embed the question → retrieve top-K similar chunks from the vector database → assemble chunks into a prompt template → call LLM → return grounded answer. Optional production additions: reranking, citation extraction, RAGAS evaluation, and caching.

What is chunking and why does it matter?+

Chunking is the process of splitting source documents into smaller segments before embedding. Chunk size is a critical design decision: chunks that are too small lose context (the chunk alone may not answer any question); chunks that are too large dilute the embedding (the vector averages across multiple topics). Common approaches: fixed-size with overlap (256–512 tokens, 10–20% overlap), recursive character splitting, and semantic chunking (split at topic boundaries). Poor chunking is the most common root cause of poor RAG retrieval quality.

What is the role of embeddings in RAG?+

Embeddings convert text chunks and user queries into numerical vectors. The core insight is that semantically similar texts produce similar vectors — allowing the vector database to find the chunks most relevant to the user's question by measuring geometric distance in vector space. The same embedding model must be used for indexing and querying; mixing models produces incomparable vectors.

What is a retriever in LangChain?+

A retriever is a LangChain abstraction that takes a query and returns relevant documents. It wraps the vector store similarity search and optionally adds metadata filtering, score thresholds, and ensemble combinations. Common retrievers: VectorStoreRetriever, MultiQueryRetriever (generates multiple query paraphrases), ContextualCompressionRetriever (compresses retrieved documents to keep only relevant passages), and BM25Retriever (keyword-based).

What is reranking and when should you add it?+

Reranking is a two-stage retrieval strategy: broad vector search retrieves top-20 candidates; a cross-encoder reranker then reorders them by relevance to the specific query, returning the top 3–5. Reranking improves retrieval precision at the cost of added latency (~50–200ms). Add reranking when your initial retrieval returns semantically close but topically irrelevant chunks, or when improving factual grounding is worth the latency trade-off. Cohere Rerank and FlashRank are common options.

How do you add citations to RAG responses?+

Citations connect each part of the LLM's answer to the specific retrieved chunks it used. Implementation approaches: (1) prompt the LLM to include source references inline; (2) parse the response and map back to source metadata (source file, page number, URL); (3) use structured output mode to return citations as a separate field alongside the answer. Cohere's Command models have native citation support. Citations are important for enterprise RAG applications where users need to verify answers.

How does RAG reduce hallucination?+

RAG reduces hallucination by giving the LLM specific, relevant, up-to-date context in the prompt. Instead of relying on potentially incorrect training memory, the model is instructed to answer only using the provided context and to say "I don't know" if the context doesn't contain enough information. The system prompt typically includes an explicit instruction like "Answer based only on the provided context. If the context doesn't contain the answer, say so."

What is RAGAS and how do you use it?+

RAGAS is an open-source framework for evaluating RAG systems. It measures four metrics: faithfulness (does the answer contradict the retrieved context?), context precision (are the retrieved chunks relevant?), context recall (were all necessary chunks retrieved?), and answer relevancy (does the answer address the question?). You use RAGAS by preparing a test dataset of questions with ground-truth answers and relevant document IDs, running your RAG pipeline, and computing the scores. RAGAS evaluation should be run as part of CI on every change to chunking, retrieval, or prompting logic.

What is hybrid search in RAG?+

Hybrid search combines vector similarity search (semantic) with keyword search (BM25) and merges results — typically using Reciprocal Rank Fusion (RRF). It improves retrieval when queries include specific identifiers, product codes, or exact phrases that benefit from keyword matching, while still handling semantic paraphrasing. Weaviate, Qdrant, and Azure AI Search support hybrid search natively. Adding hybrid search is a common improvement step for production RAG systems.

What are the main production challenges with RAG?+

Production RAG challenges include: poor retrieval quality from bad chunking or wrong embedding model; duplicate or overlapping retrieved chunks; context window limits constraining how much retrieved text you can include; latency from embedding, vector search, reranking, and LLM call in sequence; cost management across embedding API + vector database + LLM; access control (preventing user A from retrieving user B's documents); index freshness (updating vectors when source documents change); and evaluation drift (retrieval quality degrading over time without an evaluation harness to detect it).

When would you choose fine-tuning over RAG?+

Choose fine-tuning when: you need the model to adopt a specific writing style or tone; the task requires domain-specific vocabulary the base model doesn't handle well; the information to be learned is stable and unlikely to change; response speed is critical and retrieval adds unacceptable latency. Choose RAG when: the information is dynamic, private, or must stay current; you need citations and traceability; the knowledge base is large or grows frequently; you cannot afford fine-tuning cost and compute. See the RAG vs Fine-Tuning guide for the full decision framework.

Vector Database Interview Questions

What is a vector database and how does it differ from a relational database?+

A vector database stores high-dimensional numerical embeddings and supports similarity search — finding the K most semantically similar items to a query. A relational database (SQL) stores structured rows and supports exact-match and range queries. Vector databases are not designed for ACID transactions or relational joins; relational databases cannot do semantic similarity search. In production AI systems, both are often used together: SQL for application data, vector database for semantic retrieval.

What is semantic search?+

Semantic search finds documents by meaning rather than by exact keyword match. A query for "employee health benefits" will find a document section titled "Workforce Wellness Policy" because both express the same concept in the embedding space — even though they share no words. Semantic search is powered by embedding models and vector similarity search, and is the foundation of RAG retrieval.

What similarity metrics are used in vector databases?+

Three main metrics: cosine similarity (measures the angle between vectors — the most common for text embeddings; score ranges from -1 to 1, higher is more similar); dot product (accounts for both direction and magnitude — used with normalised embeddings); Euclidean distance (straight-line distance in vector space — lower is more similar; common for image embeddings). For most text-based RAG applications, cosine similarity is the standard choice.

What is a metadata filter in a vector database?+

Metadata filters allow you to combine semantic similarity search with structured attribute filters — for example, "find the most semantically similar documents, but only where source = HR_Policy and date > 2025-01-01". This is essential for enterprise RAG systems where different document categories, access levels, or time ranges must be scoped per query. All major vector databases (Pinecone, Weaviate, Qdrant) support metadata filtering.

What is pgvector?+

pgvector is a PostgreSQL extension that adds vector column types and similarity search operators to PostgreSQL. It allows teams to store embeddings alongside their existing relational data and run vector similarity queries using SQL. Good for teams already running Postgres who want to avoid adding new infrastructure. Handles moderate vector counts (under ~10M) well; purpose-built vector databases outperform it at very large scale.

What is FAISS and when would you use it?+

FAISS (Facebook AI Similarity Search) is an open-source library for high-performance vector similarity search. It is a library — not a full database — and operates primarily in-memory. It has no built-in API server, metadata storage, or persistence layer. Use FAISS via LangChain for prototyping, research, or applications where you control the full lifecycle of the index in-process. For production with persistence, multi-user access, or metadata filtering, use a proper vector database.

What is the difference between Pinecone and Chroma?+

Pinecone is a fully managed cloud vector database — no infrastructure to run, serverless tier available, production-grade SLA, metadata filtering and namespace isolation built in. Good for production deployments. Chroma is an open-source, Python-native vector store that installs as a pip package and persists locally — zero infrastructure overhead. Good for prototyping, demos, and small-scale development. In a typical AI engineering project: Chroma for local development, Pinecone for production cloud deployment.

What is hybrid search and how does it improve retrieval?+

Hybrid search combines vector similarity search (semantic) with BM25 keyword search and merges the results using Reciprocal Rank Fusion or a weighted score. It improves retrieval precision for queries that include specific identifiers, technical terms, or exact phrases that keyword matching handles well, while still benefiting from semantic understanding for conceptual queries. Weaviate, Qdrant, and Azure AI Search support hybrid search natively.

AI Agents and Agentic AI Interview Questions

What is an AI agent?+

An AI agent is an LLM-powered system that autonomously decides which actions to take — calling tools, browsing the web, writing and executing code, querying databases — in order to complete a goal. Unlike a simple chatbot that only generates text, an agent has tools it can use and follows a loop of Reasoning → Acting → Observing the result → Reasoning again until the task is complete.

How are AI agents different from chatbots?+

A chatbot takes a user message and generates one text response. An AI agent can take a user instruction and execute multiple steps autonomously: searching a database, calling an API, running code, reading a file, and writing output — all without requiring the user to direct each step. Agents use tool calling to interact with external systems and maintain state across multiple reasoning steps.

What is an agentic workflow?+

An agentic workflow is an AI pipeline where the LLM dynamically decides the sequence of steps — which tools to call, in what order, and when to stop — based on intermediate results. This contrasts with a fixed pipeline where steps are predefined. LangGraph is the primary framework for building structured agentic workflows with explicit state management and conditional routing.

What is tool calling in the context of agents?+

Tool calling allows an LLM to invoke external functions you define — a database query, API call, calculator, code executor, or file reader. You register tools with their names, descriptions, and parameter schemas. When the LLM decides a tool should be called, it outputs a structured tool-call message. Your application executes the function and returns the result to the LLM, which continues reasoning. Tool calling is the mechanism that gives agents the ability to take actions in the world.

What is memory in an AI agent?+

Memory in agents has two forms: short-term memory (the conversation history and intermediate results within a single session — stored in the context window); and long-term memory (persistent storage outside the context window — often a vector database or SQL database that stores summaries of past sessions, user preferences, or knowledge accumulated over time). Managing memory is a core design challenge in production agents: what to keep, what to summarise, and what to retrieve.

What is the planner/executor pattern?+

In the planner/executor pattern, one LLM call (the planner) decomposes the user goal into a sequence of sub-tasks; separate LLM calls (executors) or tools then carry out each sub-task. The planner also handles replanning when an executor fails or returns unexpected results. This pattern improves reliability for complex multi-step tasks by separating high-level reasoning from low-level execution.

What is a multi-agent system?+

A multi-agent system uses multiple specialised AI agents that collaborate — one agent might search the web, another analyses retrieved data, a third writes a report, and an orchestrator coordinates them. Multi-agent systems are more capable than single agents for complex tasks but introduce coordination overhead, communication costs, and failure cascade risks. LangGraph and CrewAI are the main frameworks for building multi-agent systems in Python.

What is LangGraph and how is it different from LangChain agents?+

LangGraph is a framework for building stateful, graph-based agent workflows. Unlike LangChain's original agent executor (which ran a generic ReAct loop), LangGraph gives you explicit control over state structure (via typed TypedDict nodes), conditional routing between nodes, cycles for retry/reflection, and human-in-the-loop checkpoints. LangGraph is preferred for production agents where you need predictable behaviour, debuggable state, and complex workflows with branching logic.

What is CrewAI?+

CrewAI is a framework for building role-based multi-agent systems. You define agents with specific roles (researcher, writer, analyst), assign tasks to agents, and define a process (sequential or hierarchical). CrewAI is higher-level than LangGraph — easier to get started with multi-agent systems, but less granular control over state and routing. Good for content creation pipelines, research summarisation, and agentic tasks that map naturally to human team structures.

What is human-in-the-loop in agent systems?+

Human-in-the-loop is a design pattern where the agent pauses at defined checkpoints to request human approval, clarification, or correction before proceeding. LangGraph supports this natively via breakpoints and persistence. HITL is important for high-stakes or irreversible agent actions (sending emails, modifying databases, making API calls with real-world effects). In production, HITL is often part of the QA and safety layer.

What are guardrails in AI agents?+

Guardrails are constraints that prevent an AI agent from taking harmful, unintended, or off-policy actions. They can be implemented as: input validation (rejecting out-of-scope queries); output validation (checking LLM responses before returning them to users); tool use restrictions (limiting which tools agents can invoke); rate limiting; and monitoring with alerting when anomalous behaviour is detected. NeMo Guardrails and Guardrails AI are Python libraries for systematic guardrail implementation.

How do you evaluate an AI agent?+

Agent evaluation is harder than static LLM evaluation because agents take multiple steps. Approaches: trajectory evaluation (did the agent take the right sequence of steps?); final output evaluation (did the agent produce the correct result?); tool use accuracy (did it call the right tools with the right parameters?); latency and cost tracking (how many LLM calls and tokens per task?); and failure mode analysis (what percentage of tasks fail and why?). LangSmith and custom logging are the primary tools for agent tracing and evaluation.

MCP Interview Questions

LangChain Interview Questions

What is LangChain?+

LangChain is a Python (and JavaScript) framework for building applications on top of LLMs. It provides abstractions and integrations for: prompt templates, LLM wrappers, document loaders, text splitters, embedding models, vector stores, retrievers, chains, agents, and output parsers. Its main strength is accelerating RAG prototype development through its extensive integration ecosystem — connecting to over 100 LLM providers, vector databases, and data sources.

What is a chain in LangChain?+

A chain in LangChain is a sequence of components connected together — for example, a prompt template → LLM → output parser. LangChain Expression Language (LCEL) uses a pipe operator syntax (prompt | llm | parser) to compose chains declaratively. Chains can be simple (prompt to LLM) or complex (retriever → prompt → LLM → parser → router → another chain).

What is a retriever in LangChain?+

A retriever is a LangChain interface that takes a query string and returns relevant document chunks. It wraps vector store similarity search but can also implement custom logic — multi-query retrieval (generating multiple paraphrases of the query), contextual compression (trimming irrelevant parts of retrieved chunks), or ensemble retrieval (combining vector search and BM25). Retrievers are the core component connecting vector stores to RAG chains.

How do tools work in LangChain?+

LangChain tools are Python functions decorated or wrapped as tools that an agent can choose to invoke. Each tool has a name, description, and input schema. When a LangChain agent decides to use a tool, it generates the tool name and arguments as a structured output; LangChain invokes the function and passes the result back to the agent for the next reasoning step. Tool design — writing clear names and descriptions — directly impacts whether the agent chooses tools correctly.

How does a LangChain agent work?+

A LangChain agent is an LLM given a set of tools and a goal. It follows a reasoning loop: the LLM reasons about the goal → decides whether to call a tool → LangChain executes the tool → observation is added to context → LLM reasons again. LangChain supports several agent types: ReAct, OpenAI Functions, and OpenAI Tools agents. For production stateful agents, LangGraph is preferred over the built-in agent executor.

What is the difference between LangChain and LangGraph?+

LangChain provides components for building LLM applications and RAG systems. LangGraph is built on top of LangChain and provides a graph-based framework for stateful, cyclical agent workflows. LangChain agents run a generic loop with limited state control; LangGraph lets you define explicit nodes, edges, conditional routing, state schemas, and human-in-the-loop checkpoints. Use LangChain for straightforward RAG pipelines; use LangGraph for complex, stateful, or multi-agent systems.

What are LangChain's main production limitations?+

Common LangChain production criticisms: abstraction layers make debugging harder when something goes wrong inside a chain; version updates frequently introduced breaking changes (a significant pain in production maintenance); the framework encourages rapid prototyping which can result in unmaintainable code in larger systems; and for very high-performance use cases, the abstraction overhead can be replaced with direct API calls. LangChain is excellent for prototyping; production teams often strip back to more direct code over time.

What is LangSmith?+

LangSmith is a LangChain-developed platform for tracing, evaluating, and monitoring LangChain applications. It records every step of a chain or agent run — inputs, outputs, latency, token counts — and makes them browsable and searchable. It supports creating test datasets, running evaluations, and comparing runs. LangSmith is the primary tool for debugging LangChain applications and setting up RAG evaluation pipelines in the LangChain ecosystem.

AI Engineering System Design Questions

For each system design question, structure your answer around: architecture, data flow, retrieval design, model choice, evaluation, deployment, security, and monitoring.

Q: Design a company policy RAG chatbot.

Your answer should cover:

▸Document ingestion pipeline (PDF, DOCX, Confluence) → chunking → embedding → Pinecone/pgvector
▸Query pipeline: embed query → retrieve top-K with metadata filter (department, access level) → rerank → assemble prompt → LLM → answer with citations
▸Evaluation: RAGAS faithfulness and context precision on a labelled HR question set
▸Access control: namespace per department; user role metadata filter on retrieval
▸Deployment: FastAPI + Docker; Redis cache for repeated queries; CloudWatch/Datadog for monitoring

Q: Design a customer support AI assistant at scale.

Your answer should cover:

▸Support knowledge base indexed in vector database (FAQ + resolved ticket history)
▸Query understanding: classify intent first (billing / technical / general) → route to specialised retrievers
▸Fallback: if confidence below threshold, escalate to human agent with retrieved context pre-populated
▸Latency SLA: p95 < 2 seconds — requires retrieval and LLM call to be optimised; response streaming
▸Feedback loop: human agent corrections feed back into test dataset for RAGAS re-evaluation

Q: Design a multi-agent research assistant.

Your answer should cover:

▸Orchestrator agent: receives user goal, decomposes into research tasks
▸Sub-agents: search agent (web search), retrieval agent (internal documents), synthesis agent (summarise and combine)
▸LangGraph state machine: typed state, conditional routing, retry logic on tool failure
▸Output: structured research report with citations from both web and internal sources
▸Human-in-the-loop checkpoint before final report delivery for high-stakes research tasks

Q: Design a PDF Q&A system for a legal firm.

Your answer should cover:

▸Ingestion: PDF extraction → page-level or section-level chunking → law-domain embedding model → Qdrant with metadata (case, date, jurisdiction, access level)
▸Retrieval: hybrid search (vector + BM25 for exact legal term matching) + reranking
▸Access control: case-level namespacing; only lawyers assigned to a case can retrieve its documents
▸Evaluation: legal matter expert panel reviews sample Q&A outputs monthly; RAGAS automated weekly
▸Audit log: every query, retrieved chunks, LLM call, and answer logged for compliance

Q: Design an MCP-connected enterprise assistant.

Your answer should cover:

▸MCP servers: CRM (Salesforce), HR system, internal knowledge base, code repository
▸Agent: LangGraph-based, with MCP tool discovery at startup; decides which MCP tools to invoke per query
▸Authentication: OAuth 2.0 per MCP server; agent token has scoped permissions per system
▸Session management: conversation memory in Redis; long-term memory in vector database
▸Monitoring: LangSmith tracing; alert on tool call error rates > 1% or p95 latency > 3s

Q: Design a production LLM API with monitoring.

Your answer should cover:

▸FastAPI endpoint wrapping LLM calls; Pydantic input validation; structured output enforcement
▸Middleware: API key authentication, rate limiting per key (Redis), request/response logging
▸Retry logic: exponential backoff on LLM provider rate limit errors; fallback to secondary model
▸Observability: structured JSON logs with request_id, model, token counts, latency; Datadog dashboard
▸Evaluation: sample 5% of traffic for async LLM-as-judge quality scoring; alert if quality drops below threshold

Q: Design a RAG system for clinical or pharma documents.

Your answer should cover:

▸Data governance: documents classified by sensitivity tier; PII detection before indexing
▸Chunking: section-level (clinical trial sections are meaningful units); domain-specific BioBERT embedding model
▸Access control: researcher role metadata filter; regulatory team has different access scope
▸Hallucination mitigation: low temperature; system prompt instructs "do not infer beyond provided context"; mandatory citations
▸Regulatory audit trail: all queries and retrieved contexts logged with user ID, timestamp, and answer for compliance review

Q: Design a secure multi-tenant enterprise AI assistant.

Your answer should cover:

▸Tenant isolation: Pinecone namespace per tenant; tenant ID injected as metadata filter on all queries — users can never retrieve another tenant's documents
▸Authentication: JWT auth at API gateway; tenant ID extracted from token, not from user input
▸Data residency: option to deploy per-region for tenants with data sovereignty requirements
▸Audit logging: every query, retrieval, and answer logged per tenant for compliance
▸Monitoring: per-tenant usage dashboard; anomaly detection on query volume and retrieval patterns to detect data exfiltration attempts

Deployment and LLMOps Interview Questions

For in-depth production deployment training, see the Production AI Engineering training.

How do you deploy an AI application with FastAPI and Docker?+

Build a FastAPI app with your LLM or RAG logic → write a Dockerfile that installs dependencies and starts uvicorn → build the image → push to a container registry (ECR, GCR, Docker Hub) → deploy to cloud (AWS ECS, Google Cloud Run, Azure Container Apps). Environment variables (API keys, database URLs) are injected via secrets management (AWS Secrets Manager, GCP Secret Manager). Health check endpoint (/health) is required for orchestration platforms to manage the container lifecycle.

How do you handle latency in a production AI API?+

Latency reduction strategies: response streaming (return tokens as generated rather than waiting for completion); caching (Redis cache for repeated queries — hash the query and return cached response if available); model selection (smaller, faster models for low-complexity queries); parallel LLM calls for multi-part responses; async request handling; and retrieval optimisation (reduce the number of chunks retrieved, tune chunk size to minimise LLM context). Measure p50, p95, and p99 latency per endpoint.

How do you log AI application requests and responses?+

Use structured JSON logging with a consistent schema per log entry: request_id, timestamp, user_id (hashed), model, input_tokens, output_tokens, latency_ms, retrieved_chunks (IDs only, not full content), and response_status. Send logs to a centralised log aggregation system (CloudWatch, Datadog, ELK Stack). Never log raw user inputs or full LLM responses if they may contain PII — apply masking or only log hashed/truncated versions.

What is LLMOps?+

LLMOps is the practice of operating LLM-based applications in production — analogous to MLOps for traditional ML but adapted for the specific challenges of LLMs. It includes: prompt versioning and management; output evaluation pipelines; latency and cost monitoring; model version management (handling LLM provider model deprecations); A/B testing of prompts and retrieval strategies; and incident response when LLM output quality degrades.

How do you monitor an AI application in production?+

Monitor at four levels: infrastructure (CPU, memory, container health — standard cloud monitoring); API (request rate, error rate, latency per endpoint — Datadog, CloudWatch); LLM (token counts per request, model error rates, cost per query — via LLM provider dashboards and structured logs); quality (sampled automated RAGAS evaluation, human review of flagged outputs, user feedback signals). Set alerts for: error rate > 1%, p95 latency > SLA, token cost anomalies, and quality metric degradation.

How do you control costs in a production LLM application?+

Cost control: choose the smallest model that meets quality requirements for each query type; cache frequent queries in Redis; limit context window by trimming low-relevance retrieved chunks; use streaming to detect early termination opportunities; track per-request token counts in structured logs; set hard token limits on API calls; implement cost budgets per customer or tenant; and review the cost-per-query dashboard weekly to detect usage anomalies.

What is environment variable management in AI applications?+

AI applications use environment variables to store API keys, database connection strings, and configuration — never hardcoding secrets in source code. Locally, use .env files loaded by python-dotenv (never committed to git — add to .gitignore). In production, use a secrets management service (AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault) to inject secrets at runtime. In Docker/Kubernetes, secrets are mounted as environment variables or files, not baked into the image.

How do you handle LLM provider failures in production?+

Production AI applications should implement: retry with exponential backoff for transient rate limit and timeout errors; fallback to an alternative model or provider if the primary is unavailable (e.g., fallback from GPT-4o to Claude Sonnet); circuit breaker pattern to stop hammering a failing service; graceful degradation (return cached or default response rather than a 500 error); and alerting when error rates exceed thresholds. Structuring your LLM calls behind an abstraction layer makes provider switching easier.

Project-Based Interview Questions

See the AI Engineer Projects guide to build a portfolio you can confidently explain.

Walk me through your RAG project.+

Structure your answer as: (1) Problem — what was the use case and why RAG over a simple prompt? (2) Architecture — document loading, chunking strategy and size, embedding model choice, vector database and why, retrieval configuration; (3) LLM — model choice and system prompt design; (4) Evaluation — what metrics you used and what scores you achieved; (5) Deployment — where and how you deployed it; (6) Results — what worked and what you would improve. Avoid describing a tutorial copy — explain the decisions you made and why.

How did you evaluate your retrieval quality?+

Describe the evaluation approach you used: whether you set up RAGAS with a labelled test dataset (questions + ground truth answers + relevant chunk IDs), or used manual spot-checking, or an LLM-as-judge setup. Mention which RAGAS metrics you tracked (faithfulness, context precision, context recall) and what your scores were. Interviewers want to see that you measured whether your system worked — not just that you built it.

How did you handle hallucinations in your AI project?+

Describe your mitigation strategy: RAG grounding with citations; low temperature setting; a system prompt instructing the model to say "I don't know" when context is insufficient; output validation checking that the answer is supported by retrieved content; human review sampling. If you measured hallucination rate before and after a mitigation, that is a strong answer.

How did you deploy your project?+

Describe the full deployment path: FastAPI endpoint wrapping the RAG or agent logic → Dockerfile → built and pushed to a container registry → deployed to a cloud service (Cloud Run, ECS, Railway, etc.) → environment variables managed securely → health check endpoint → basic monitoring in place. If you deployed to a personal server or just ran it locally, be honest — but explain what production deployment would look like for the same system.

How did you monitor cost in your AI project?+

Describe how you tracked token usage: logging input and output token counts per request, estimating cost per query from the LLM provider pricing, and identifying the most expensive query patterns. If you implemented caching for repeated queries, mention that. If you chose a cheaper model for low-complexity queries, explain that decision. Cost awareness is a strong signal even in a small side project.

What would you improve in your project if you had more time?+

This is an opportunity to demonstrate production thinking. Good answers include: better chunking strategy with overlap optimisation; adding hybrid search to improve precision; implementing a reranker; setting up a full RAGAS evaluation pipeline rather than spot-checking; adding response streaming; improving access control; adding a feedback loop where user corrections feed into the test dataset. Pick the improvements that address the real weaknesses of your current implementation.

What trade-offs did you make in your project?+

Engineers make trade-offs. Strong answers acknowledge them: "I used Chroma locally because I didn't want to manage Pinecone infrastructure for a prototype — in production I would use Pinecone for reliability"; "I used a fixed chunk size of 512 tokens which was a reasonable default but I didn't run a chunk size ablation study"; "I chose GPT-4o-mini for cost reasons — the retrieval quality compensates for the smaller model in most queries but edge cases exist".

What tools did you use in your project and why?+

Be specific and confident: name the LLM provider (OpenAI / Anthropic / Gemini), embedding model, vector database, framework (LangChain / LangGraph), API framework (FastAPI), cloud platform, and monitoring tool. For each, give one sentence of rationale — not just "I used LangChain because it's popular" but "I used LangChain because it has native Chroma integration which accelerated the prototype, and I can swap the vector store layer if needed".

What was your biggest debugging challenge?+

Describe a real debugging experience: poor retrieval quality because chunks were too large and embeddings were diffuse; a LangChain agent calling the wrong tool because the tool description was ambiguous; a FastAPI endpoint throwing 422 errors because the Pydantic model didn't match the LLM's structured output schema; or a RAGAS faithfulness score of 0.4 that led you to discover the LLM was ignoring the retrieved context. How you debug AI systems is a major indicator of experience.

How is your project different from a tutorial?+

This is a direct prompt to demonstrate non-tutorial thinking. Strong differentiators: you evaluated the project (RAGAS or equivalent) and iterated on chunking or retrieval based on results; you deployed it to a real environment; you added features like citations, hybrid search, or access control; you handled edge cases the tutorial ignores; you benchmarked chunking strategies; or you made conscious trade-off decisions and documented them. The answer must be specific — vague claims of "it's more advanced" are unconvincing.

Interview Questions by Experience Level

Experience Level	Likely Interview Focus	Key Preparation
Fresher with Python basics	LLM fundamentals, RAG concepts, embeddings, simple API calls, one beginner project	Build one complete RAG project; study LLM basics, tokens, embeddings, and vector databases; deploy to GitHub
Software developer moving into AI	LLM APIs, RAG pipelines, prompt engineering, FastAPI deployment, LangChain basics, project walkthrough	Leverage existing API and deployment skills; add LangChain + RAG knowledge; build a deployed RAG project
Data scientist moving into AI engineering	RAG vs fine-tuning decision, evaluation metrics (RAGAS), production deployment, agent design, LangGraph	Focus on engineering side: FastAPI, Docker, deployment, LangGraph; demonstrate deployed production project, not notebooks
Data engineer moving into RAG systems	Vector database design, metadata strategy, chunking, ETL to embedding pipeline, access control, pgvector	Map existing ETL expertise to embedding pipelines; understand vector database architecture and production RAG operational concerns
Senior developer / architect	System design, multi-agent architecture, RAG at scale, cost optimisation, evaluation strategy, team guidance, LLMOps	Prepare detailed system design answers; demonstrate production operational experience; show you have made and justified architectural decisions

Common Mistakes in AI Engineering Interviews

✗

Only talking about ChatGPT

Interviewers want to know you can build applications that use LLMs via API, design prompt templates, configure parameters, and evaluate outputs — not just use a chat interface.

✗

Weak Python and API basics

Most AI engineering roles require solid Python. If you struggle to explain async functions, Pydantic models, or FastAPI routing in an interview, it signals you are not ready for production AI work.

✗

No deployed project

A project that only runs in a Jupyter notebook is not production evidence. Even a Cloud Run deployment of a simple FastAPI endpoint demonstrates deployment readiness.

✗

No understanding of RAG evaluation

Building a RAG pipeline without evaluating it is like shipping code without tests. Know what RAGAS measures and be able to describe how you would set up evaluation on your project.

✗

No cost or latency awareness

Real AI applications have budgets and SLAs. If you cannot estimate the cost per query or discuss latency trade-offs in your architecture, it signals you have not thought beyond the demo.

✗

Copying tutorial projects without modification

Interviewers can quickly identify tutorial code. Be able to explain what you changed, what you evaluated, and what you would do differently — or build something that departs meaningfully from standard tutorials.

✗

Not explaining trade-offs

Every architectural decision involves trade-offs. "I used Chroma because it was simple to set up and appropriate for this prototype — for production I would use Pinecone for reliability and managed infrastructure" shows engineering judgement.

✗

Ignoring security and access control

Production AI systems handle sensitive data. Knowing how to scope vector database namespacing, enforce retrieval access control, and handle API key security is expected for senior roles and is a differentiator even for junior ones.

30-Day AI Engineering Interview Preparation Plan

Follow the AI Engineering Roadmap and the AI Engineer Skills guide alongside this plan.

Week 1Fundamentals and LLM basics

·Study LLM fundamentals: tokens, context windows, temperature, API basics
·Call OpenAI or Anthropic API directly in Python — parse JSON responses, handle errors
·Understand prompt templates, system prompts, and few-shot prompting
·Study embeddings: what they are, how to generate them, why similar texts produce similar vectors
·Read the AI Engineering Guide and AI Engineer Skills guide

Week 2RAG, vector databases and LangChain

·Build a complete RAG pipeline from scratch: document loading → chunking → embedding → Chroma → retrieval → LLM
·Study chunking strategies: fixed-size, recursive, overlap — experiment with different sizes
·Learn LangChain: chains, retrievers, prompt templates, output parsers, LCEL
·Read the RAG guide and the Vector Database guide
·Set up RAGAS on your RAG project: run faithfulness and context precision metrics

Week 3Agents, MCP and deployment

·Build a LangGraph agent: typed state, tool definition, conditional routing, human-in-the-loop
·Understand MCP: read the What is MCP? guide; optionally build a simple MCP server
·Wrap your RAG project in a FastAPI app with Pydantic input validation and structured output
·Write a Dockerfile; build and run the containerised app locally
·Deploy to Cloud Run or Railway; add environment variable management for API keys

Week 4Projects, GitHub, mock interviews and system design

·Finalise your RAG project: add RAGAS evaluation results, deployment URL, and architecture diagram to the README
·Prepare a 5-minute spoken walkthrough of your project: problem, architecture, decisions, evaluation, results
·Practice 3 system design questions aloud: company RAG chatbot, customer support assistant, multi-agent research tool
·Review all question sections in this guide; write out answers to any you could not answer confidently
·Do at least 2 mock interviews — with a peer, a mentor, or via the 1:1 AI Engineering Mentorship

Recommended Technovids Learning Path

Goal	Resource
Understand AI Engineering as a discipline and role	AI Engineering Guide →
Follow a sequenced interview prep roadmap	AI Engineering Roadmap →
Know exactly which skills interviewers test	AI Engineer Skills Guide →
Build a portfolio you can confidently explain	AI Engineer Projects Guide →
Understand AI engineering salary and career potential	AI Engineer Salary India →
Join structured live training with 5 production projects	AI Engineering Course →
Get 1:1 project guidance and mock interview prep	1:1 AI Engineering Mentorship →
Learn production deployment and LLMOps in depth	Production AI Engineering →
Explore all Technovids AI resources	AI Engineering Resource Library →

Preparing for AI engineering interviews and need guided support?

Knowing the questions is one thing. Being able to answer them with a real project behind you, a confident system design response, and a deployed GitHub portfolio is another. Technovids offers live AI engineering training — cohort-based courses with project guidance, and 1:1 mentorship for personalised interview preparation.

Explore AI Engineering Course →Book 1:1 AI Engineering Mentorship Read AI Engineer Projects Guide

Frequently Asked Questions — AI Engineering Interviews

What are common AI engineering interview questions?+

Common AI engineering interview questions cover: Python and API basics (how to call an LLM API, parse JSON, handle errors); LLM concepts (what is a token, temperature, structured output); RAG pipeline architecture (chunking, embeddings, vector search, retrieval, generation); vector databases (semantic search, similarity metrics, tool comparison); AI agents and tool calling; LangChain, LangGraph, and CrewAI; MCP protocol awareness; deployment with FastAPI and Docker; evaluation with RAGAS; and project walkthroughs where you explain what you built, how you evaluated it, and what trade-offs you made.

What should I prepare for an AI engineer interview?+

Prepare across five areas: (1) Core concepts — LLMs, tokens, embeddings, RAG, vector databases, agents, tool calling, MCP; (2) Frameworks — LangChain, LangGraph, CrewAI, or equivalent; (3) Deployment — FastAPI, Docker, cloud basics, logging, evaluation; (4) Projects — at least one deployed RAG or agent project you can explain end-to-end; (5) System design — ability to design a company knowledge chatbot or document Q&A system with proper architecture, retrieval, evaluation, and monitoring. Avoid only knowing ChatGPT prompting — interviewers want practical application-building skills.

Are RAG questions common in AI engineering interviews?+

Yes — RAG questions are among the most common in AI engineering interviews, especially for roles at companies building knowledge assistants, document search tools, or AI chatbots. Expect questions on the full pipeline: document loading, chunking strategy, embedding model selection, vector database choice, retrieval design, reranking, prompt template construction, and RAGAS evaluation. You should be able to draw and explain the full RAG architecture and describe the decisions you made in a real project.

Do AI engineers need machine learning theory?+

AI engineers need working knowledge of machine learning concepts — what training, fine-tuning, and embeddings are — but deep theoretical ML (backpropagation math, optimisation algorithms, model architectures) is not typically tested unless the role is explicitly model-building or ML engineering. The focus is on practical AI application engineering: building RAG pipelines, deploying LLM-powered APIs, designing agent workflows, evaluating outputs, and shipping to production. ML theory is a plus; applied engineering ability is the requirement.

What projects should I explain in an AI engineering interview?+

The highest-signal projects are: (1) A deployed RAG knowledge assistant — with chunking strategy, vector database, retrieval pipeline, FastAPI endpoint, and RAGAS evaluation; (2) A LangGraph multi-agent workflow — with typed state, tool calling, and conditional routing; (3) An LLM-powered API with structured outputs — deployed to cloud, with logging and error handling; (4) An MCP-connected assistant — even a simple MCP server exposing two tools is a strong differentiator. For any project, be prepared to explain architecture decisions, what you evaluated, what failed, and what you would improve.

How do I prepare for LLM interview questions?+

To prepare for LLM interview questions: understand what tokens are and how context windows work; know the difference between temperature, top-p, and max tokens; understand system prompts and prompt templates; learn structured outputs and function/tool calling; understand how to evaluate LLM outputs (RAGAS, human review, LLM-as-judge); know the cost and latency trade-offs between different model sizes and providers; and be able to explain why a given model might hallucinate and how RAG reduces hallucination. Read the API documentation for at least one LLM provider (OpenAI or Anthropic) in depth.

What system design questions are asked for AI engineering roles?+

Common AI engineering system design questions include: "Design a company policy RAG chatbot", "Design a customer support AI assistant at scale", "Design a multi-agent research system", "Design a PDF Q&A tool", "Design an AI system with access control for enterprise documents", and "Design a production LLM API with monitoring and fallback". For each, structure your answer around: problem statement and scope, data flow and architecture, retrieval design, model choice and rationale, evaluation strategy, deployment architecture, security and access control, monitoring and alerting.

Is LangChain important for AI interviews?+

LangChain is commonly expected knowledge for AI engineering roles. Interviewers may ask about chains, prompt templates, retrievers, tool integration, LCEL (LangChain Expression Language), LangGraph for stateful agents, and when LangChain's abstraction layer adds overhead versus helping. You should understand LangChain's strengths (rapid RAG prototyping, strong integrations) and its limitations (abstraction overhead, debugging difficulty in complex pipelines). LangGraph is increasingly important for multi-agent and stateful workflow roles.

Should I learn AI agents for interviews?+

Yes. AI agents are rapidly becoming a core interview topic as companies shift from static LLM applications to agentic systems. Understand what makes a system "agentic" (a model deciding tool use and sequencing), how tool calling works, what memory and state management mean in agent systems, how LangGraph differs from LangChain agents, multi-agent coordination patterns, and human-in-the-loop design. Even one hands-on project with LangGraph agents is highly differentiating — most candidates can talk about agents but cannot demonstrate them.

How do freshers prepare for AI engineering interviews?+

Freshers with a technical background (CS, engineering, Python, or data science) should: (1) Learn Python API basics — calling LLM APIs, handling JSON, building FastAPI endpoints; (2) Understand LLMs, tokens, embeddings, and RAG conceptually and hands-on; (3) Build one complete RAG project end-to-end using LangChain, Chroma, and FastAPI; (4) Push the project to GitHub with a README that explains architecture, evaluation results, and how to run it; (5) Prepare to explain every decision in the project. The Technovids AI Engineering Course covers all of this with live instructor guidance and five guided production projects.

How do software developers move into AI engineering roles?+

Software developers moving into AI engineering have strong advantages — API integration, code quality, deployment, and debugging skills translate directly. Focus your upskilling on: understanding LLM APIs, prompt engineering, and structured outputs; building and deploying a RAG pipeline; learning LangChain and LangGraph; deploying a FastAPI endpoint with logging and monitoring; and evaluating outputs with RAGAS. Projects and GitHub are critical — a deployed RAG assistant or LangGraph agent signals AI engineering readiness far more than certifications alone.

Which Technovids resource should I read next?+

Start with the AI Engineering Guide for the full landscape, then the AI Engineering Roadmap for a sequenced learning path. To understand the specific skills interviewers test, see the AI Engineer Skills guide. To build a portfolio you can confidently explain, see the AI Engineer Projects guide. For live structured training across the full interview-relevant stack — LLMs, RAG, agents, LangChain, MCP, and deployment — with five guided production projects and instructor feedback, see the AI Engineering Course.