闲鱼怎么做钓鱼网站/公司网络组建方案

闲鱼怎么做钓鱼网站,公司网络组建方案,wordpress 农业主题公园,个人免费网站如何做The Illustrated Stable Diffusion 1. The components of Stable Diffusion1.1. Image information creator1.2. Image Decoder 2. What is Diffusion anyway?2.1. How does Diffusion work?2.2. Painting images by removing noise 3. Speed Boost: Diffusion on compressed…

The Illustrated Stable Diffusion

  • 1. The components of Stable Diffusion
    • 1.1. Image information creator
    • 1.2. Image Decoder
  • 2. What is Diffusion anyway?
    • 2.1. How does Diffusion work?
    • 2.2. Painting images by removing noise
  • 3. Speed Boost: Diffusion on compressed (latent) data instead of the pixel image
  • 4. The Text Encoder: A Transformer language model
    • 4.1. How is CLIP trained?
  • 5. Feeding text information into the image generation process
    • 5.1. Layers of the Unet Noise predictor without text
    • 5.2. Layers of the Unet Noise predictor with text
  • 6. Conclusion
  • 7. Resources
  • Acknowledgements
  • Citation
  • References

https://jalammar.github.io/illustrated-stable-diffusion/

This is a gentle introduction to how Stable Diffusion works.

text-to-image,text2img,T2I

在这里插入图片描述

paradise /ˈpærədaɪs/ n. 天堂,乐园 (指美好的环境),(某些宗教所指的) 天国,乐土,伊甸园,(某类活动或某类人的) 完美去处
cosmic /ˈkɑːzmɪk/ adj. 宇宙的,巨大且重要的
beach /biːtʃ/ n. 海滩,沙滩,海滨,湖滨 v. (使) 上岸,把...拖上岸

Stable Diffusion is versatile in that it can be used in a number of different ways. Let’s focus at first on image generation from text only (text2img). The image above shows an example text input and the resulting generated image (The actual complete prompt is here). Aside from text to image, another main way of using it is by making it alter images (so inputs are text + image).
除了将文本转换为图像之外,另一种主要的使用方式是让它改变图像 (因此输入是文本 + 图像)。

versatile /ˈvɜːrsətl/ adj. 多功能的,多才多艺的,多用途的,多面手的,有多种技能的
alter /ˈɔːltər/ v. 改变,(使) 更改,修改 (衣服使更合身),改动
pirate /ˈpaɪrət/ n. 海盗,盗版者,盗印者,道德败坏者,违法者,侵犯专利权者,非法仿制者,非法播音者 v. 当海盗,从事劫掠,盗印,窃用,以海盗方式劫掠,抢掠 adj. 盗版的,盗用的,剽窃的

在这里插入图片描述

1. The components of Stable Diffusion

Stable Diffusion is a system made up of several components and models. It is not one monolithic model.

monolithic /ˌmɑːnəˈlɪθɪk/ adj. 庞大而单一的,整体式的,单一的,独块巨石的,整料的,由块料组成的,单片的,单块的,庞大而无特点的,巨大而单调的 n. 单片电路,单块电路

As we look under the hood, the first observation we can make is that there’s a text-understanding component that translates the text information into a numeric representation that captures the ideas in the text.

hood /hʊd/ n. (设备或机器的) 防护罩,罩,风帽,兜帽 (外衣的一部分,可拉起蒙住头颈),(布质) 面罩,街区,学位连领帽 (表示学位种类),(汽车等的) 折叠式车篷 vt. 罩上,覆盖

在这里插入图片描述

We’re starting with a high-level view and we’ll get into more machine learning details later in this article. However, we can say that this text encoder is a special Transformer language model (technically: the text encoder of a CLIP model). It takes the input text and outputs a list of numbers representing each word/token in the text (a vector per token).

That information is then presented to the Image Generator, which is composed of a couple of components itself.

在这里插入图片描述

The image generator goes through two stages:

1.1. Image information creator

This component is the secret sauce of Stable Diffusion. It’s where a lot of the performance gain over previous models is achieved.

sauce /sɔːs/ n. 酱,调味汁,无礼的话 (或举动),讨厌的话 (或举动) vt. 对...无礼,给...增加趣味或风味,调味或加沙司于...

This component runs for multiple steps to generate image information. This is the steps parameter in Stable Diffusion interfaces and libraries which often defaults to 50 or 100.

The image information creator works completely in the image information space (or latent space). We’ll talk more about what that means later in the post. This property makes it faster than previous diffusion models that worked in pixel space. In technical terms, this component is made up of a UNet neural network and a scheduling algorithm.

The word “diffusion” describes what happens in this component. It is the step by step processing of information that leads to a high-quality image being generated in the end (by the next component, the image decoder).

在这里插入图片描述

1.2. Image Decoder

The image decoder paints a picture from the information it got from the information creator. It runs only once at the end of the process to produce the final pixel image.

在这里插入图片描述

With this we come to see the three main components (each with its own neural network) that make up Stable Diffusion:

  • ClipText for text encoding

Input: text.
Output: 77 token embeddings vectors, each in 768 dimensions.

  • UNet + Scheduler to gradually process/diffuse information in the information (latent) space

Input: text embeddings and a starting multi-dimensional array (structured lists of numbers, also called a tensor) made up of noise.
Output: A processed information array

  • Autoencoder Decoder that paints the final image using the processed information array

Input: The processed information array (dimensions: (4, 64, 64))
Output: The resulting image (dimensions: (3, 512, 512) which are (red/green/blue, width, height))

diffuse /dɪˈfjuːs , dɪˈfjuːz/ adj. 扩散的,漫射的,弥漫的,不清楚的,冗长的,难解的,啰唆的 v. (使气体或液体) 扩散,弥漫,渗透,(使光) 模糊,漫射,漫散,传播,使分散,散布,普及

在这里插入图片描述

2. What is Diffusion anyway?

Diffusion is the process that takes place inside the pink “image information creator” component. Having the token embeddings that represent the input text, and a random starting image information array (these are also called latents), the process produces an information array that the image decoder uses to paint the final image.

在这里插入图片描述

This process happens in a step-by-step fashion. Each step adds more relevant information. To get an intuition of the process, we can inspect the random latents array, and see that it translates to visual noise. Visual inspection in this case is passing it through the image decoder.

intuition /ˌɪntuˈɪʃn/ n. (一种) 直觉,直觉力

在这里插入图片描述

Diffusion happens in multiple steps, each step operates on an input latents array, and produces another latents array that better resembles the input text and all the visual information the model picked up from all images the model was trained on.

resemble /rɪˈzembl/ vt. 像,类似于,看起来像,显得像

在这里插入图片描述

We can visualize a set of these latents to see what information gets added at each step.

在这里插入图片描述

The process is quite breathtaking to look at.

在这里插入图片描述在这里插入图片描述

在这里插入图片描述在这里插入图片描述

在这里插入图片描述在这里插入图片描述

在这里插入图片描述在这里插入图片描述

在这里插入图片描述在这里插入图片描述

Something especially fascinating happens between steps 2 and 4 in this case. It’s as if the outline emerges from the noise.

emerge /iˈmɜːrdʒ/ v. (从隐蔽处或暗处) 出现,浮现,显现,暴露,露出,显露,被知晓,幸存下来,摆脱出来,露头,露出真相

在这里插入图片描述在这里插入图片描述

2.1. How does Diffusion work?

The central idea of generating images with diffusion models relies on the fact that we have powerful computer vision models. Given a large enough dataset, these models can learn complex operations. Diffusion models approach image generation by framing the problem as following:

Say we have an image, we generate some noise, and add it to the image.

在这里插入图片描述

This can now be considered a training example. We can use this same formula to create lots of training examples to train the central component of our image generation model.

在这里插入图片描述

While this example shows a few noise amount values from image (amount 0, no noise) to total noise (amount 4, total noise), we can easily control how much noise to add to the image, and so we can spread it over tens of steps, creating tens of training examples per image for all the images in a training dataset.

在这里插入图片描述

With this dataset, we can train the noise predictor and end up with a great noise predictor that actually creates images when run in a certain configuration. A training step should look familiar if you’ve had ML exposure:

在这里插入图片描述

2.2. Painting images by removing noise

Let’s now see how this can generate images.

The trained noise predictor can take a noisy image, and the number of the denoising step, and is able to predict a slice of noise.

在这里插入图片描述

The sampled noise is predicted so that if we subtract it from the image, we get an image that’s closer to the images the model was trained on (not the exact images themselves, but the distribution - the world of pixel arrangements where the sky is usually blue and above the ground, people have two eyes, cats look a certain way - pointy ears and clearly unimpressed).

unimpressed /ˌʌnɪmˈprest/ adj. 印象平平的,无深刻印象的

在这里插入图片描述

If the training dataset was of aesthetically pleasing images (e.g., LAION Aesthetics https://laion.ai/blog/laion-aesthetics/, which Stable Diffusion was trained on), then the resulting image would tend to be aesthetically pleasing. If the we train it on images of logos, we end up with a logo-generating model.

aesthetical [i:s'θetɪkəl] adj. 美的,美学的,审美的
please [pliz] int. 请务必,请问,太感谢了,收敛点儿 v. 喜欢,使满意,使愉快

在这里插入图片描述

This concludes the description of image generation by diffusion models mostly as described in Denoising Diffusion Probabilistic Models (https://arxiv.org/abs/2006.11239). Now that you have this intuition of diffusion, you know the main components of not only Stable Diffusion, but also Dall-E 2 and Google’s Imagen.

Note that the diffusion process we described so far generates images without using any text data. So if we deploy this model, it would generate great looking images, but we’d have no way of controlling if it’s an image of a pyramid or a cat or anything else. In the next sections we’ll describe how text is incorporated in the process in order to control what type of image the model generates.
请注意,我们到目前为止描述的扩散过程无需使用任何文本数据即可生成图像。因此,如果我们部署此模型,它将生成外观精美的图像,但我们无法控制它是金字塔、猫还是其他图像。在下一节中,我们将描述如何在该过程中合并文本以控制模型生成的图像类型。

3. Speed Boost: Diffusion on compressed (latent) data instead of the pixel image

To speed up the image generation process, the Stable Diffusion paper runs the diffusion process not on the pixel images themselves, but on a compressed version of the image. The paper calls this “Departure to Latent Space” (High-Resolution Image Synthesis with Latent Diffusion Models).

This compression (and later decompression/painting) is done via an autoencoder. The autoencoder compresses the image into the latent space using its encoder, then reconstructs it using only the compressed information using the decoder.

departure [dɪˈpɑrtʃər] n. 离开,出发,背离,起程

在这里插入图片描述

Now the forward diffusion process is done on the compressed latents. The slices of noise are of noise applied to those latents, not to the pixel image. And so the noise predictor is actually trained to predict noise in the compressed representation (the latent space).

在这里插入图片描述

The forward process (using the autoencoder’s encoder) is how we generate the data to train the noise predictor. Once it’s trained, we can generate images by running the reverse process (using the autoencoder’s decoder).

在这里插入图片描述

These two flows are what’s shown in Figure 3 of the LDM/Stable Diffusion paper:

在这里插入图片描述

This figure additionally shows the “conditioning” components, which in this case is the text prompts describing what image the model should generate. So let’s dig into the text components.

4. The Text Encoder: A Transformer language model

A Transformer language model is used as the language understanding component that takes the text prompt and produces token embeddings. The released Stable Diffusion model uses ClipText (A GPT-based model), while the paper used BERT.

The Illustrated GPT-2 (Visualizing Transformer Language Models)
https://jalammar.github.io/illustrated-gpt2/

The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)
https://jalammar.github.io/illustrated-bert/

The choice of language model is shown by the Imagen paper to be an important one. Swapping in larger language models had more of an effect on generated image quality than larger image generation components.

Larger / better language models have a significant effect on the quality of image generation models. https://arxiv.org/abs/2205.11487

在这里插入图片描述

The early Stable Diffusion models just plugged in the pre-trained ClipText model released by OpenAI. It’s possible that future models may switch to the newly released and much larger OpenCLIP variants of CLIP (True enough, Stable Diffusion V2 uses OpenClip). This new batch includes text models of sizes up to 354M parameters, as opposed to the 63M parameters in ClipText.

plug [plʌɡ]:n. 插头,(电源) 插座,转换插头,塞子 v. 堵塞,封堵,补足,补充

LARGE SCALE OPENCLIP: L/14, H/14 AND G/14 TRAINED ON LAION-2B
https://laion.ai/blog/large-openclip/

Stable Diffusion 2.0 Release
https://stability.ai/news/stable-diffusion-v2-release

4.1. How is CLIP trained?

CLIP is trained on a dataset of images and their captions. Think of a dataset looking like this, only with 400 million images and their captions:

在这里插入图片描述

In actuality, CLIP was trained on images crawled from the web along with their “alt” tags.

CLIP is a combination of an image encoder and a text encoder. Its training process can be simplified to thinking of taking an image and its caption. We encode them both with the image and text encoders respectively.

actuality [ˌæktʃuˈæləti] n. 实际,真实,真实情况,现实情况
crawl [krɔl] v. 爬,爬行,匍匐行进,(昆虫) 爬行 n. 自由泳,爬泳,缓慢的速度

在这里插入图片描述

We then compare the resulting embeddings using cosine similarity. When we begin the training process, the similarity will be low, even if the text describes the image correctly.

在这里插入图片描述

We update the two models so that the next time we embed them, the resulting embeddings are similar.

在这里插入图片描述

By repeating this across the dataset and with large batch sizes, we end up with the encoders being able to produce embeddings where an image of a dog and the sentence “a picture of a dog” are similar. Just like in word2vec, the training process also needs to include negative examples of images and captions that don’t match, and the model needs to assign them low similarity scores.

The Illustrated Word2vec
https://jalammar.github.io/illustrated-word2vec/

5. Feeding text information into the image generation process

To make text a part of the image generation process, we have to adjust our noise predictor to use the text as an input.

在这里插入图片描述

Our dataset now includes the encoded text. Since we’re operating in the latent space, both the input images and predicted noise are in the latent space.

在这里插入图片描述

To get a better sense of how the text tokens are used in the Unet, let’s look deeper inside the Unet.

5.1. Layers of the Unet Noise predictor without text

Let’s first look at a diffusion Unet that does not use text. Its inputs and outputs would look like this:

在这里插入图片描述

Inside, we see that:

  • The Unet is a series of layers that work on transforming the latents array
  • Each layer operates on the output of the previous layer
  • Some of the outputs are fed (via residual connections) into the processing later in the network
  • The timestep is transformed into a time step embedding vector, and that’s what gets used in the layers

在这里插入图片描述

5.2. Layers of the Unet Noise predictor with text

Let’s now look how to alter this system to include attention to the text.

在这里插入图片描述

The main change to the system we need to add support for text inputs (technical term: text conditioning) is to add an attention layer between the ResNet blocks.

在这里插入图片描述

Note that the ResNet block doesn’t directly look at the text. But the attention layers merge those text representations in the latents. And now the next ResNet can utilize that incorporated text information in its processing.

6. Conclusion

I hope this gives you a good first intuition about how Stable Diffusion works. Lots of other concepts are involved, but I believe they’re easier to understand once you’re familiar with the building blocks above.

7. Resources

DreamStudio, https://beta.dreamstudio.ai/generate
The Annotated Diffusion Model, https://huggingface.co/blog/annotated-diffusion
What are Diffusion Models? https://lilianweng.github.io/posts/2021-07-11-diffusion-models/
Stable Diffusion with 🧨 Diffusers, https://huggingface.co/blog/stable_diffusion

Acknowledgements

Citation

https://jalammar.github.io/illustrated-stable-diffusion/

References

[1] Yongqiang Cheng, https://yongqiang.blog.csdn.net/

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/pingmian/73009.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

yarn 装包时 package里包含sqlite3@5.0.2报错

yarn 装包时 package里包含sqlite35.0.2报错 解决方案: 第一步: 删除package.json里的sqlite35.0.2 第二步: 装包,或者增加其他的npm包 第三步: 在package.json里增加sqlite35.0.2,并运行yarn装包 此…

buu-bjdctf_2020_babystack2-好久不见51

整数溢出漏洞 将nbytes设置为-1就会回绕,变成超大整数 从而实现栈溢出漏洞 环境有问题 from pwn import *# 连接到远程服务器 p remote("node5.buuoj.cn", 28526)# 定义后门地址 backdoor 0x400726# 发送初始输入 p.sendlineafter(b"your name…

DHCP 配置

​ 最近发现,自己使用虚拟机建立的集群,在断电关机或者关机一段时间后,集群之间的链接散了,并且节点自身的 IP 也发生了变化,发现是 DHCP 的问题,这里记录一下。 DHCP ​ DHCP(Dynamic Host C…

股指期货合约的命名规则是怎样的?

股指期货合约的命名规则其实很简单,主要由两部分组成:合约代码和到期月份。 股指期货合约4个字母数字背后的秘密 股指期货合约一般来说都是由字母和数字来组合的,包含了品种代码和到期的时间,下面我们具体来看看。 咱们以“IF23…

OSPF 协议详解:从概念原理到配置实践的全网互通实现

什么是OSPF OSPF(开放最短路径优先)是由IETF开发的基于链路状态的自治系统内部路由协议,用来代替存在一些问题的RIP协议。与距离矢量协议不同,链路状态路由协议关心网络中链路活接口的状态(包括UP、DOWN、IP地址、掩码…

蓝桥杯 之 数论

文章目录 习题质数找素数 数论,就是一些数学问题,蓝桥杯十分喜欢考察,常见的数论的问题有:取模,同余,大整数分解,素数,质因数,最大公约数,最小公倍数等等 素…

Beans模块之工厂模块注解模块@Qualifier

博主介绍:✌全网粉丝5W,全栈开发工程师,从事多年软件开发,在大厂呆过。持有软件中级、六级等证书。可提供微服务项目搭建与毕业项目实战,博主也曾写过优秀论文,查重率极低,在这方面有丰富的经验…

C# HTTP 文件上传、下载服务器

程序需要管理员权限,vs需要管理员打开 首次运行需要执行以下命令注册URL(管理员命令行) netsh advfirewall firewall add rule name"FileShare" dirin actionallow protocolTCP localport8000 ipconfig | findstr "IPv4&quo…

FPGA中串行执行方式之计数器控制

FPGA中串行执行方式之计数器控制 使用计数器控制的方式实现状态机是一种简单且直观的方法。它通过计数器的值来控制状态的变化,从而实现顺序逻辑。计数器的方式特别适合状态较少且状态转移是固定的场景。 基本原理 计数器控制的状态机 ​例程1:简单的顺序状态机 以下是一个…

纯vue手写流程组件

前言 网上有很多的vue的流程组件,但是本人不喜欢很多冗余的代码,喜欢动手敲代码;刚开始写的时候,确实没法下笔,最后一层一层剥离,总算实现了;大家可以参考我写的代码,可以拿过去定制…

数字化转型驱动卫生用品安全革新

当315晚会上晃动的暗访镜头揭露卫生巾生产车间里漂浮的异物、纸尿裤原料仓中霉变的碎屑时,这一触目惊心的场景无情地撕开了“贴身安全”的遮羞布,暴露的不仅是部分企业的道德缺失,更凸显了当前检测与监管体系的漏洞,为整个行业敲响…

【JavaWeb学习Day27】

Tlias前端 员工管理 条件分页查询&#xff1a; 页面布局 搜索栏&#xff1a; <!-- 搜索栏 --><div class"container"><el-form :inline"true" :model"searchEmp" class"demo-form-inline"><el-form-item label…

Python进阶教程丨lambda函数

1. lambda函数是什么&#xff1f; 在 Python 里&#xff0c;lambda 函数是一种特殊类型的函数&#xff0c;也被叫做匿名函数。匿名”意味着它不需要像常规函数那样使用 def 来进行命名。lambda lambda 函数本质上是简洁的临时函数 &#xff0c;它适用于只需要简单逻辑的场景&a…

苹果HFS+56TB存储MOV文件出错的恢复方法

HFS文件系统是Apple电脑中默认的最常见的文件系统。HFS来源于UNIX&#xff0c;优势就是稳定性&#xff0c;另外HFS是支持日志功能的&#xff0c;所以很多存储设备也采用了HFS文件系统。再稳定的文件系统也有“马失前蹄”的时候&#xff0c;下面就来聊下HFS出现文件出错、丢失时…

电源电路篇

电源电路篇 一、LDO-Low Dropout Regulator(低压差线性稳压器)1.1 AMS1117-3.3V芯片 二、DCDC-Direct Current to Direct Current(开关稳压器)2.1 降压(Buck)电路2.1.1 TPS5450-5V芯片 一、LDO-Low Dropout Regulator(低压差线性稳压器) LDO是一种线性稳压器&#xff0c;用于提…

java项目之在线购物系统(源码+文档)

项目简介 在线购物系统实现了以下功能&#xff1a; 使用在线购物系统的用户分管理员和用户两个角色的权限子模块。 管理员所能使用的功能主要有&#xff1a;主页、个人中心、用户管理、商品分类管理、商品信息管理、系统管理、订单管理等。 用户可以实现主页、个人中心、我的…

go语言中空结构体

空结构体(struct{}) 普通理解 在结构体中&#xff0c;可以包裹一系列与对象相关的属性&#xff0c;但若该对象没有属性呢&#xff1f;那它就是一个空结构体。 空结构体&#xff0c;和正常的结构体一样&#xff0c;可以接收方法函数。 type Lamp struct{}func (l Lamp) On()…

Unity实现连连看连线效果

1.一个比较简单的向量计算&#xff0c;用的LineRenderer实现&#xff1b; 已知起始A点和终点C点&#xff0c;求B点&#xff1b; 先计算A点到C点的向量取归一化当做方向&#xff0c;再给定一个“模长”&#xff08;B点到A点的模长&#xff09;乘以该方向&#xff0c;最后加上L…

【MySQL】触发器与存储引擎

目录 触发器基本概念触发器操作创建触发器NEW 与 OLD查看触发器删除触发器 注意事项 存储引擎基本概念基本操作查询当前数据库支持的存储引擎查看当前的默认存储引擎查看某个表用的存储引擎创建表时指定存储引擎修改表的存储引擎 触发器 基本概念 概述&#xff1a; 触发器&a…

能“嘎嘎提升”提升用户居住体验的智能家居物联网框架推荐!

智能家居在日常生活中给我们的带来了更多的便利&#xff0c;更让有些用户切实地体会到了科技的魅力&#xff0c;对于想要打造属于自己的智能家居氛围感的用户们&#xff0c;以下是一些能够帮助提升居住体验的智能家居物联网框架及应用&#xff1a; 1. 涂鸦智能&#xff08;Tuy…