27个机器学习图表翻译
Infographics are crucial for presenting information in a more digestible fashion to the audience. With their usage being expanding to many (if not all) professions like journalism, science, and research, advertisements, business, the research on automating the process of generating beautiful and user-centric infographics has been the latest features of the data visualization community.
信息图表对于以更易消化的方式向观众展示信息至关重要。 随着它们的使用扩展到许多(如果不是全部)行业,例如新闻,科学,研究,广告,商业,关于自动生成美观和以用户为中心的信息图表的过程的研究已成为数据可视化社区的最新功能。
In this series of posts, we will discuss 5 pioneering research papers focusing on automating the process of generating beautiful infographics for different types of data.
在本系列文章中,我们将讨论5项开拓性研究论文,这些论文专注于针对不同类型的数据自动生成漂亮的信息图表的过程。
Presently, there are many, very powerful design software and code libraries supporting infographic generation from data. The list below mentions some of these tools and libraries that you might want to check out. However, when it comes to designing infographics, the process is not straightforward. To create a very engaging piece of art, it requires expensive labor and is generally very time-consuming. Every small piece of information, from selecting what topics are to be highlighted, to all the way for choosing color combinations, the skillset required to create an infographic is also very diverse.
当前,有许多非常强大的设计软件和代码库支持从数据生成信息图。 下面的列表提到了您可能需要检出的一些工具和库。 但是,在设计信息图表时,过程并不简单。 为了制造出非常吸引人的艺术品,这需要昂贵的劳动并且通常非常耗时。 从选择要突出显示的主题到选择颜色组合的所有小信息,创建信息图所需的技能也非常多样化。
Softwares Supporting Infographic Generation
支持信息图生成的软件
- Microsoft Powerpoint (Design Ideas) Microsoft Powerpoint(设计思路)
- Microsoft PowerBI — For developing data visualization dashboards Microsoft PowerBI —用于开发数据可视化仪表板
- Adobe Illustrator Adobe Illustrator的
- Tableau 画面
Javascript packages to help create infographics
Javascript包可帮助创建信息图表
- D3.js D3.js
- Highcharts.js Highcharts.js
自动化的信息图表设计 (Automated Infographics Design)
Recent research in Information Visualization has seen increased interest in automating/semi-automating the complicated process of infographics generation. However, the main purpose of this research has not been to completely take control away from humans, but instead, focus on developing techniques to support designers in the decision-making process. To study this research, we have divided the papers broadly into 5 categories:
信息可视化的最新研究已经看到了对自动化/半自动化信息图表生成过程的兴趣。 但是,这项研究的主要目的并不是要完全摆脱对人类的控制,而是要着重于开发技术以在决策过程中为设计师提供支持。 为了研究这项研究,我们将论文大致分为5类:
- Timeline infographics design automation 时间线图表设计自动化
- Icon design automation 图标设计自动化
- Information flow based automation 基于信息流的自动化
- Text-based automation 基于文本的自动化
- Image Chart fusion automation 图像图表融合自动化
时间线图表生成[1,2] (Timeline Infographics Generation [1,2])
As the name suggests, these methods try to automatically design an infographic for time-based data. One of the approaches works directly on the bitmap images of an already existing timeline infographic to extract global and local information. Global information can be of the type: Orientation, Layout (Unified, Faceted, Segmented, etc. ) and Representation type (Radial, Linear, etc.). Similarly, local information is about the bounding boxes containing a piece of information in the infographic, for example, text boxes, icons, etc. These methods use already existing Convolutional Neural Networks to draw bounding boxes or segment the infographic for local information and also predict the values for global information via classification. After the template information is extracted, we can replace the old information with a new piece of information to get a new infographic automatically.
顾名思义,这些方法尝试为基于时间的数据自动设计图表。 一种方法直接在已经存在的时间线图表的位图图像上工作,以提取全局和本地信息。 全局信息的类型可以是:方向,布局(统一,多面,分段等)和表示类型(径向,线性等)。 类似地,本地信息与包含信息图表中的一条信息的边界框有关,例如文本框,图标等。这些方法使用已经存在的卷积神经网络绘制边界框或对信息图进行分段以获取局部信息,并预测通过分类获取全球信息的值。 提取模板信息后,我们可以将旧信息替换为新信息,以自动获取新信息图。
One the other hand, there exists a visualization dashboard called Timeline Storyteller [2] which directly takes the raw CSV/Excel sheet of timeline data and generates infographics which can be later customized by the users as per their design choices. The users can design infographics and animations with even very large time-series datasets and import pictures of their choice into these infographics as shown in the example. Try the timeline storyteller here.
另一方面,存在一个可视化仪表板,称为时间线讲故事[2],它直接获取时间线数据的原始CSV / Excel工作表并生成信息图表,这些信息以后可根据用户的设计选择进行定制。 用户可以使用非常大的时间序列数据集设计图表和动画,并将他们选择的图片导入这些图表,如示例所示。 在此处尝试时间轴讲故事的人。
图标设计自动化[3] (Icon Design Automation [3])
The next category we have on our list is about techniques to design complex icons. So given an input text, for example, House Cleaning, the task is to come up with a semantically meaningful icon. Now, the problem might look simple to search for icons by each word in the query, for example, an icon for “House” and another one for “Cleaning”. Now combine both of these icons and there we go, we have a compound icon. Even though this is correct for simple queries, however, the data for semantically labeled icons is scarce. So we need to figure out ways to extend the existing semantic knowledge of labeled icons to other sectors that are not so well explored. For this purpose, using the well-studied word embeddings from Natural Language Processing can be useful.
我们列表中的下一个类别是有关设计复杂图标的技术。 因此,给定输入文本(例如, House Cleaning) ,任务是拿出一个语义上有意义的图标。 现在,按查询中每个单词搜索图标看起来似乎很简单,例如,一个图标用于“房屋”,另一个图标用于“清洁”。 现在将这两个图标结合起来,我们开始制作复合图标。 即使对于简单查询来说这是正确的,但是,带有语义标签的图标的数据却很少。 因此,我们需要找出将标签图标的现有语义知识扩展到其他领域的方法,而这些领域并没有得到很好的探索。 为此,使用自然语言处理中经过深入研究的词嵌入可能会很有用。
Given a query text, we calculate the nearest word for each unigram that is annotated and is associated with an icon in the existing dataset. Then the extracted icons from the query unigrams are ranked on the basis of style compatibility. To measure style compatibility, an embedding vector is generated for each icon describing it’s style. So closer the style vectors of two icons, similar they are in style. For this purpose, we can train a CNN to generate these style embeddings. This model is trained on an existing 1000 human-curated compound icons dataset where the individual icons inside a compound icon were considered more similar in styles as opposed to a different style of that icon occurring in another input compound icon.
给定一个查询文本,我们将为每个有符号的,与现有数据集中的图标相关联的字母组合计算最接近的单词。 然后,根据样式兼容性对从查询字母组合中提取的图标进行排名。 为了衡量样式的兼容性,将为每个描述样式的图标生成一个嵌入向量。 如此接近两个图标的样式矢量,它们的样式相似。 为此,我们可以训练CNN来生成这些样式嵌入。 该模型在现有的1000个人类管理的复合图标数据集上进行了训练,在该数据集中,复合图标内的各个图标在样式上被认为与在另一个输入复合图标中出现的该图标的不同样式更加相似。
For the final part of the jigsaw, when the icons are filtered based on semantics and style compatibility, they are placed based on space compatibility. To calculate the space compatibility, the icons from the 1k human-curated compound icons are studied to generate templates based on each of the icons (shown in the image above). This is done to generate an idea of where the other icon can be placed relative to the current icon. Using this information, the icons are placed in the template to generate compound icons.
对于拼图的最后一部分,当根据语义和样式兼容性对图标进行过滤时,将根据空间兼容性来放置图标。 为了计算空间兼容性,研究了来自1k种人类固化复合图标的图标,以基于每个图标生成模板(如上图所示)。 这样做是为了产生一个想法,即相对于当前图标可以放置另一个图标。 使用此信息,将图标放置在模板中以生成复合图标。
基于信息流的自动化[4] (Information Flow Based Automation [4])
Moving on to the text category, this work focuses on extracting information flow in infographics.
转到文本类别,此工作着重于提取信息图表中的信息流。
Given an infographic image, the information flow is basically a way to display the direction of visual group placements inside that image. Visual Groups are the information containing segments inside an infographic which are repeated to present a full picture. The flow of these visual groups is called Narrative Flow.
给定一个信息图图像,信息流基本上是一种显示图像内部视觉组放置方向的方法。 可视组是信息,这些信息包含信息图表内的片段,这些片段会重复显示完整的图片。 这些视觉组的流动称为叙事流 。
This paper classifies these Narrative Flow patterns into 12 classes based on the studied Visual Groups and their placements in the 13k infographic images dataset. Object Detection CNNs were used to initially detect Visual Groups containing Icons and Texts inside infographics and then the placements were studied to generate the information flow diagram.
根据研究的视觉组及其在13k信息图图像数据集中的位置,将这些叙事流模式分为12类。 使用对象检测CNN最初检测信息图表中包含图标和文本的视觉组,然后研究放置位置以生成信息流程图。
This paper discusses a Flow Extraction Algorithm to group the bounding boxes detected by the CNN (YOLO) into visual groups based on proximity and size and then detect the flow of these visual groups to predict the final visual information flow. Besides this, this system is also able to perform a reverse selection and classification where the users draw the direction of information flow and the system fetches the relevant infographics with a similar direction of flow. Also, as discussed above, the 12 classification categories of information flow are shown in the image below.
本文讨论了一种流量提取算法 ,该算法将CNN( YOLO )所检测到的边界框根据接近度和大小分为可视组,然后检测这些可视组的流量以预测最终的可视信息流。 除此之外,该系统还能够执行反向选择和分类,其中用户绘制信息流的方向,并且系统以相似的流向获取相关信息图表。 另外,如上所述,下图显示了信息流的12个分类类别。
This paper also studies the spatial distribution of different elements inside the infographics based on these 12 classes, as shown below.
本文还基于这12个类别研究了信息图表内部不同元素的空间分布,如下所示。
基于文本的自动化[5] (Text-Based Automation [5])
Another system in this series is known as the Text-to-Viz. Given a statistical statement, this system tries to directly come up with complete infographic design. Unlike other tools for infographic management, where the user needs to/can edit the final design of the infographic, Text-to-Viz generates these well defined, aesthetic infographics which need no editing. The best use case of this system if for the scenarios where the user doesn’t need to create a very design rich infographic but needs something simple and quick to present a piece of statistical information in a better way. According to this paper, there are 4 types of most common infographics:
该系列中的另一个系统称为“ 文本到视频”。 给定统计报表,此系统将尝试直接提出完整的信息图表设计。 与用户需要/可以编辑信息图的最终设计的其他信息图管理工具不同,Text-to-Viz生成了这些定义清晰,美观的信息图,无需进行编辑。 该系统的最佳用例是针对以下情况:用户不需要创建非常丰富的信息图表,而是需要简单快速地以更好的方式呈现统计信息的情况。 根据本文,最常见的信息图表有4种类型:
- Statistical-based: Infographics containing charts, pictographs, etc. for presenting statistical information. 基于统计的:包含图表,象形文字等的信息图表,用于呈现统计信息。
- Timeline-based: Presenting timeline information. 基于时间轴:显示时间轴信息。
- Process-based: Step by step action presentation. 基于过程:分步操作演示。
- Location-based: Showing information on a map. 基于位置:在地图上显示信息。
Since, according to this research, around 50% of the infographics are statistical-based, and in that, around 45% are proportion-based, they only tried to create an automatic infographic generation system for this set of infographics. After that, the next step was to study different parts of the proportion-based information text. An example is shown below where they are trying to classify and extract pieces of information to be designed separately.
由于根据这项研究,大约50%的信息图表是基于统计的 ,而其中大约45%是基于比例的 ,因此他们仅尝试为这组图表创建自动的信息图表生成系统。 之后,下一步是研究基于比例的信息文本的不同部分。 下面显示了一个示例,他们试图对这些信息进行分类和提取,以分别设计。
Next up, the design space needs to be separated based on where different elements are to be placed. The researchers came up with 20 template designs where different elements could be placed based on the rules mentioned in the paper.
接下来,需要根据放置不同元素的位置来分隔设计空间。 研究人员提出了20种模板设计,可以根据论文中提到的规则放置不同的元素。
图像图表融合自动化[6] (Image Chart Fusion Automation [6])
The last techniques in the list of automatic infographics generation are the techniques to design images containing chats, as shown in the above image. A survey on the photographic infographics showed the type of charts that are frequently used to present data embedded inside images [6]. These are Bar charts [41.2%], Pie charts [21.4%], Line charts [9.4%], and Scatterplots [2.2%]. Other than the charts, other ways of embedding this information are Single Divided Object: where the graphics are divided into smaller parts along a horizontal/vertical axis and the area of these divisions can be based on the ratio of different quantities we are trying to compare. This is followed by Multiple Resized Objects where the objects inside an image are sized according to the data they are trying to portray. Using the information about how and where the information is represented, researchers generally follow the pipeline shown below to generate final infographics.
自动信息图表生成列表中的最后一种技术是设计包含聊天的图像的技术,如上图所示。 对摄影信息图表的一项调查显示了图表的类型,这些图表通常用于展示嵌入图像内部的数据[6]。 这些是条形图[41.2%],饼图[21.4%],折线图[9.4%]和散点图[2.2%]。 除图表外,其他嵌入此信息的方法是“ 单个对象划分”:将图形沿水平/垂直轴划分为较小的部分,并且这些划分的面积可以基于我们尝试比较的不同数量的比率。 接下来是多个调整大小的对象 ,其中图像内的对象根据它们要描绘的数据进行大小调整 。 使用有关如何以及在何处表示信息的信息,研究人员通常会按照以下所示的流程生成最终信息图表。
So, from a given dataset, relevant variables are selected and the images corresponding to those variables are collected. When the user selects one of these images, then the charts are generated for the selected variables. These are to be embedded inside the selected images. At this stage, the user can either drag an area on the image to embed the chart on, or they can choose features from that image (for eg. Hough lines) to use as an anchor to overlay charts on these images.
因此,从给定的数据集中选择相关变量,并收集与这些变量相对应的图像。 当用户选择这些图像之一时,将为所选变量生成图表。 这些将嵌入到所选图像中。 在此阶段,用户可以在图像上拖动一个区域以将图表嵌入其中,也可以从该图像中选择要素(例如,霍夫线)作为锚点以将图表覆盖在这些图像上。
Overall, it is reasonable to represent “Trends/Timeline Data (Line Charts)” with Hugh Lines and “Pie Charts/Bar Charts, etc. ” with a masking technique. To fine-tune these embeddings, there are different types of distortions that can be calculated for each type of chart. For example, comparing the slope of the high lines and the line chart can give an estimated distortion of how well the line chart is embedded in the image. These values are used to optimize the fit of the charts on the images to generate aesthetic info-images. And finally, all of this is implemented in an interface where the users can use their domain knowledge or designing skills to fine-tune these automatically generated results.
总体而言,用屏蔽线表示“趋势/时间线数据(折线图)”和“休线”和“饼图/条形图等”是合理的。 为了微调这些嵌入,可以为每种图表类型计算不同类型的失真。 例如,比较高线的斜率和折线图可以给出折线图在图像中嵌入程度的估计失真。 这些值用于优化图表在图像上的拟合度,以生成美观的信息图像。 最后,所有这些都在一个界面中实现,用户可以在其中使用他们的领域知识或设计技能来微调这些自动生成的结果。
结论 (Conclusion)
We discussed methods for generating infographics on different types of datasets: Timeline, Icons, Text, and Charts. All of these methods focus on a certain aspect of infographics focusing on the type of data they are trying to represent. These cues are generally an outcome of a survey of already existing infographics and then use that information to automate the process. This is still a new research area with a very promising future. The future direction of research can be to explore more variety of infographics and then combine the existing techniques with the new techniques to create a more holistic, generalized technique to automate/semi-automate this tedious process of infographics generation.
我们讨论了在不同类型的数据集上生成图表的方法:时间线,图标,文本和图表。 所有这些方法都集中在信息图形的某个方面,集中在它们试图表示的数据类型上。 这些提示通常是对已经存在的信息图表进行调查的结果,然后使用该信息来自动化流程。 这仍然是一个新的研究领域,前景光明。 未来的研究方向可以是探索更多种类的信息图表,然后将现有技术与新技术结合以创建更全面,通用的技术来自动化/半自动化这种繁琐的信息图表生成过程。
翻译自: https://towardsdatascience.com/information-organization-with-infographics-using-machine-learning-a-survey-54b2169c1f21
27个机器学习图表翻译
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391131.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!