大厂100 NLP interview questions外企

CLASSIC NLP

TF-IDF & ML (8)
  1. Write TF-IDF from scratch.

  2. What is normalization in TF-IDF ?

  3. Why do you need to know about TF-IDF in our time, and how can you use it in complex models?

  4. Explain how Naive Bayes works. What can you use it for?

  5. How can SVM be prone to overfitting?

  6. Explain possible methods for text preprocessing ( lemmatization and stemming ). What algorithms do you know for this, and in what cases would you use them?

  7. What metrics for text similarity do you know?

  8. Explain the difference between cosine similarity and cosine distance. Which of these values can be negative? How would you use them?

METRICS (7)
  1. Explain precision and recall in simple words and what you would look at in the absence of F1 score?

  2. In what case would you observe changes in specificity ?

  3. When would you look at macro, and when at micro metrics? Why does the weighted metric exist?

  4. What is perplexity? What can we consider it with?

  5. What is the BLEU metric?

  6. Explain the difference between different types of ROUGE metrics?

  7. What is the difference between BLUE and ROUGE?

WORD2VEC(9)
  1. Explain how Word2Vec learns? What is the loss function? What is maximized?

  2. What methods of obtaining embeddings do you know? When will each be better?

  3. What is the difference between static and contextual embeddings?

  4. What are the two main architectures you know, and which one learns faster?

  5. What is the difference between Glove, ELMO, FastText, and Word2Vec ?

  6. What is negative sampling and why is it needed? What other tricks for Word2Vec do you know, and how can you apply them?

  7. What are dense and sparse embeddings? Provide examples.

  8. Why might the dimensionality of embeddings be important?

  9. What problems can arise when training Word2Vec on short textual data, and how can you deal with them?

RNN & CNN(7)
  1. How many training parameters are there in a simple 1-layer RNN ?

  2. How does RNN training occur?

  3. What problems exist in RNN?

  4. What types of RNN networks do you know? Explain the difference between GRU and LSTM?

  5. What parameters can we tune in such networks? (Stacking, number of layers)

  6. What are vanishing gradients for RNN? How do you solve this problem?

  7. Why use a Convolutional Neural Network in NLP, and how can you use it? How can you compare CNN within the attention paradigm?

NLP and TRANSFORMERS

ATTENTION AND TRANSFORMER ARCHITECTURE (15)
  1. How do you compute attention ? (additional: for what task was it proposed, and why?)

  2. Complexity of attention? Compare it with the complexity in RNN.

  3. Compare RNN and attention . In what cases would you use attention, and when RNN?

  4. Write attention from scratch.

  5. Explain masking in attention.

  6. What is the dimensionality of the self-attention matrix?

  7. What is the difference between BERT and GPT in terms of attention calculation ?

  8. What is the dimensionality of the embedding layer in the transformer?

  9. Why are embeddings called contextual? How does it work?

  10. What is used in transformers, layer norm or batch norm , and why?

  11. Why do transformers have PreNorm and PostNorm ?

  12. Explain the difference between soft and hard (local/global) attention?

  13. Explain multihead attention.

  14. What other types of attention mechanisms do you know? What are the purposes of these modifications?

  15. How does self-attention become more complex with an increase in the number of heads?

TRANSFORMER MODEL TYPES (7)

  1. Why does BERT largely lag behind RoBERTa , and what can you take from RoBERTa?

  2. What are T5 and BART models? How do they differ?

  3. What are task-agnostic models? Provide examples.

  4. Explain transformer models by comparing BERT, GPT, and T5.

  5. What major problem exists in BERT, GPT, etc., regarding model knowledge? How can this be addressed?

  6. How does a decoder-like GPT work during training and inference? What is the difference?

  7. Explain the difference between heads and layers in transformer models.

POSITIONAL ENCODING (6)

  1. Why is information about positions lost in embeddings of transformer models with attention?

  2. Explain approaches to positional embeddings and their pros and cons.

  3. Why can’t we simply add an embedding with the token index?

  4. Why don’t we train positional embeddings?

  5. What is relative and absolute positional encoding?

  6. Explain in detail the working principle of rotary positional embeddings.

PRETRAINING (4)
  1. How does causal language modeling work?

  2. When do we use a pretrained model?

  3. How to train a transformer from scratch? Explain your pipeline, and in what cases would you do this?

  4. What models, besides BERT and GPT, do you know for various pretraining tasks?

TOKENIZERS (9)
  1. What types of tokenizers do you know? Compare them.

  2. Can you extend a tokenizer? If yes, in what case would you do this? When would you retrain a tokenizer? What needs to be done when adding new tokens?

  3. How do regular tokens differ from special tokens?

  4. Why is lemmatization not used in transformers? And why do we need tokens?

  5. How is a tokenizer trained? Explain with examples of WordPiece and BPE .

  6. What position does the CLS vector occupy? Why?

  7. What tokenizer is used in BERT, and which one in GPT?

  8. Explain how modern tokenizers handle out-of-vocabulary words?

  9. What does the tokenizer vocab size affect? How will you choose it in the case of new training?

TRAINING (8)
  1. What is class imbalance? How can it be identified? Name all approaches to solving this problem.

  2. Can dropout be used during inference, and why?

  3. What is the difference between the Adam optimizer and AdamW?

  4. How do consumed resources change with gradient accumulation?

  5. How to optimize resource consumption during training?

  6. What ways of distributed training do you know?

  7. What is textual augmentation? Name all methods you know.

  8. Why is padding less frequently used? What is done instead?

  9. Explain how warm-up works.

  10. Explain the concept of gradient clipping?

  11. How does teacher forcing work, provide examples?

  12. Why and how should skip connections be used?

  13. What are adapters? Where and how can we use them?

  14. Explain the concepts of metric learning. What approaches do you know?

INFERENCE (4)
  1. What does the temperature in softmax control? What value would you set?

  2. Explain types of sampling in generation? top-k, top-p, nucleus sampling?

  3. What is the complexity of beam search, and how does it work?

  4. What is sentence embedding? What are the ways you can obtain it?

LLM (13)
  1. How does LoRA work? How would you choose parameters? Imagine that we want to fine-tune a large language model, apply LORA with a small R, but the model still doesn’t fit in memory. What else can be done?

  2. What is the difference between prefix tuning , p-tuning , and prompt tuning ?

  3. Explain the scaling law .

  4. Explain all stages of LLM training. From which stages can we abstain, and in what cases?

  5. How does RAG work? How does it differ from few-shot KNN?

  6. What quantization methods do you know? Can we fine-tune quantized models?

  7. How can you prevent catastrophic forgetting in LLM?

  8. Explain the working principle of KV cache , Grouped-Query Attention , and MultiQuery Attention .

  9. Explain the technology behind MixTral, what are its pros and cons?

  10. How are you? How are things going?

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/818397.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

考研数学究竟有多难?基础差该如何复习?

考研数学的难度是相对的,它取决于考生的数学基础、备考时间、复习效率和解题技巧等多个因素。从历年的考试情况来看,考研数学确实具有一定的挑战性,主要体现在以下几个方面。 首先是知识覆盖面广,考研数学涵盖了高等数学、线性代…

集群伸缩简介

4.2.1.小结 Redis如何判断某个key应该在哪个实例? 将16384个插槽分配到不同的实例根据key的有效部分计算哈希值,对16384取余余数作为插槽,寻找插槽所在实例即可 如何将同一类数据固定的保存在同一个Redis实例? 这一类数据使用…

推荐系统学习记录——连续的嵌入空间

连续嵌入空间 推荐系统通常会将用户和项目(或商品)表示为向量或嵌入(embeddings),这些向量被映射到一个称为嵌入空间(embedding space)的数学空间中。在这个空间中,相似的用户或项目…

1042: 中缀表达式转换为后缀表达式

解法:直接给算法 创建一个栈和一个空的后缀表达式字符串。 遍历中缀表达式中的每个字符。 如果当前字符是操作数,直接将其添加到后缀表达式字符串中。 如果当前字符是操作符,需要将其与栈顶的操作符进行比较: 如果栈为空&#…

【Python标准库】多线程threading

1.多线程 让程序能够执行多个任务,比如下载多张图片,创建多个线程 2.多线程语法 # 1.导包 from threading import Thread# 2.创建任务即函数 def func_01():# 无参print("哈喽")def func_02(name):# 有参print(f"{name},您好")if …

Mac下载的软件显示文件已损坏,如何解决文件已损坏问题

当在Mac上下载的软件显示文件已损坏时,这可能是因为多种原因导致的,包括网络问题、下载中断、软件未完整下载、文件传输错误等。解决这个问题需要采取一些步骤来排除可能的原因,并尝试修复文件。下面将详细介绍一些常见的解决方法&#xff1a…

Qt-绘制多边形、椭圆、多条直线

1、说明 所有的绘图操作是在绘图事件中进行。mainwindow.h #ifndef MAINWINDOW_H #define MAINWINDOW_H#include <QMainWindow>QT_BEGIN_NAMESPACE namespace Ui { class MainWindow; } QT_END_NAMESPACEclass MainWindow : public QMainWindow {Q_OBJECTpublic:MainWi…

【详解算法流程+程序】DBSCAN基于密度的聚类算法+源码-用K-means和DBSCAN算法对银行数据进行聚类并完成用户画像数据分析课设源码资料包

DBSCAN(Density-Based Spatial Clustering of Applications with Noise)是一个比较有代表性的基于密度的聚类算法。 与划分和层次聚类方法不同&#xff0c;它将簇定义为密度相连的点的最大集合&#xff0c;能够把具有足够高密度的区域划分为簇&#xff0c; 并可在噪声的空间数据…

linux监控文件操作行为

linux监控文件操作行为 使用 auditd 系统 auditd 是Linux系统的一个安全和审计系统&#xff0c;它能够跟踪系统上发生的安全相关事件。要使用 auditd 来监控文件&#xff0c;你需要首先确保 auditd 已经安装并且运行在你的系统上。 然后&#xff0c;你可以使用 auditctl 命令…

MES管理系统中生产物料管理的设计

在数字化工厂建设的浪潮中&#xff0c;MES管理系统作为执行层的核心管理系统&#xff0c;其重要性日益凸显。特别是在生产物料管理方面&#xff0c;MES管理系统不仅承担物料计划指令的接收&#xff0c;还负责物料派工及使用反馈的数据收集&#xff0c;其业务流程的设计对数字化…

【AI 测试】语言交互模型测评方案

语言交互模型 具体是指那些? 语言交互模型主要指的是用于实现人类与设备之间通过自然语言进行交互的模型。这些模型通常涉及多个关键组件和技术,以理解和生成自然语言,从而实现有效的交互。 具体来说,语言交互模型可以包括以下几个主要部分: 语音识别(ASR):将声学语音…

树莓派点亮双色LED

双色LED灯准确来说叫双基色LED灯,是指模块只能显示2种颜色,一般是红色和绿色,可以有三种状态 :灭,颜色1亮,颜色2亮,根据颜色组合的不同,分为红蓝双色,黄蓝双色,红绿双色等等。 接线:将引脚S(绿色)和中间引脚(红色)连接到Raspberry Pi的GPIO接口上,对Raspberry…

文献速递:深度学习胰腺癌诊断--胰腺肿瘤的全端到端深度学习诊断

Title 题目 Fully end-to-end deep-learning-based diagnosis of pancreatic tumors 胰腺肿瘤的全端到端深度学习诊断 01 文献速递介绍 胰腺癌是最常见的肿瘤之一&#xff0c;预后不良且通常是致命的。没有肿瘤的患者只需要进一步观察&#xff0c;而胰腺肿瘤的诊断需要紧…

Java面试题:解释Java的基本数据类型及其大小和默认值,列举数据类型常见的错误知识点

Java的基本数据类型是Java编程语言中用于存储简单值的类型。这些数据类型包括整数类型、浮点类型、字符类型和布尔类型。下面是对这些基本数据类型的详细解释&#xff0c;包括它们的大小和默认值&#xff0c;以及一些常见的面试中容易出错的知识点。 基本数据类型及其大小和默…

【数据结构与算法】递推

来源&#xff1a;《信息学奥赛一本通》 所谓递推&#xff0c;是指从已知的初始条件出发&#xff0c;依据某种递推关系&#xff0c;逐次推出所要求的各中间结果及最后结果。其中初始条件或是问题本身已经给定&#xff0c;或是通过对问题的分析与化简后确定。 从已知条件出发逐…

计算机网络——应用层(3)电子邮件

电子邮件 1、概述&#xff1a; 电子邮件是使用电子设备交换的邮件及其方法。 优点&#xff1a;使用方便&#xff0c;传递迅速&#xff0c;费用低廉&#xff0c;可传送多种信息 重要标准&#xff1a; 简单邮件发送协议&#xff1a;SMTP互联网文本报文格式通用互联网邮件扩充…

leetcode717-1-bit and 2-bit Characters

题目 有两种特殊字符&#xff1a; 第一种字符可以用一比特 0 表示 第二种字符可以用两比特&#xff08;10 或 11&#xff09;表示 给你一个以 0 结尾的二进制数组 bits &#xff0c;如果最后一个字符必须是一个一比特字符&#xff0c;则返回 true 。 示例 1: 输入: bits [1, …

浏览器工作原理与实践--跨站脚本攻击(XSS):为什么Cookie中有HttpOnly属性

通过上篇文章的介绍&#xff0c;我们知道了同源策略可以隔离各个站点之间的DOM交互、页面数据和网络通信&#xff0c;虽然严格的同源策略会带来更多的安全&#xff0c;但是也束缚了Web。这就需要在安全和自由之间找到一个平衡点&#xff0c;所以我们默认页面中可以引用任意第三…

Linux--主函数的三个参数

主函数的三个参数 1).主函数的三个参数的含义: argc:主函数的参数个数 argv:主函数的参数内容 envp:环境变量; #include <stdio.h> #include <stdlib.h> #include <unistd.h> int main(int argc,char *argv[],char *envp[]) { int i0;printf("argc%d…

web前端框架设计第四课-条件判断与列表渲染

web前端框架设计第四课-条件判断与列表渲染 一.预习笔记 1.条件判断 1-1&#xff1a;v-if指令&#xff1a;根据表达式的值来判断是否输出DOM元素 1-2&#xff1a;template中使用v-if 1-3&#xff1a;v-else 1-4&#xff1a;v-else-if 1-5&#xff1a;v-show&#xff08;不支…