使用机器学习生成可维护的前端代码(附源码)

在三年内，深度学习将会改变前端的发展，它将会加快创建原型的速度，并降低构建软件的门槛。

Tony Beltramelli在去年发表了 pix2code 论文，Airbnb发布了sketch2code。

目前，自动化前端开发的最大障碍是运算能力。不过这并不妨碍我们用深度学习算法和综合训练数据，来对自动化人工智能前端进行探索。

在这篇文章中，我们将学习如何实现一个可以根据设计草图生成基本HTML和CSS网站的神经网络。以下是主要步骤：

1）给训练好的神经网络提供设计图像

2）神经网络将图像转换成HTML标记

3 ）渲染输出

我们将通过三次迭代来构建神经网络。

在第一个版本中，我们将实现获得运动部件的一个挂件的基础功能。第二个版本，HTML版将着重于自动执行所有步骤并解释神经网络层。在最终版Bootstrap版中，我们将创建一个具有可进行归纳和探索的LSTM层的模型。

所有的代码都在Github上和FloydHub的Jupyter notebooks中。所有FloydHub笔记本都在 floydhub目录中，本地代码在 local中。

这些模型基于Beltramelli的pix2code论文和Jason Brownlee的 image caption tutorials。代码是用Python和Keras（基于TensorFlow的框架）编写的。

如果你是深入学习的新手，我建议你先熟悉一下Python、反向传播和卷积神经网络。 FloydHub博客上的这三篇文章将帮助你起步：

https://blog.floydhub.com/my-first-weekend-of-deep-learning/

https://blog.floydhub.com/coding-the-history-of-deep-learning/

https://blog.floydhub.com/colorizing-b&w-photos-with-neural-networks/

核心逻辑

先回顾一下目标：我们想建立一个神经网络，生成对应于屏幕截图的HTML/CSS标记。

当你训练神经网络的时候，你可以用匹配的HTML给它生成一些截图。

它通过逐个预测所有匹配的HTML标记来学习。当它预测下一个标记的标签时，它会收到屏幕截图以及到该点所有正确标记的标签。

在Google表格中有一个简单的训练数据示例。

我们在整个教程中使用单词预测模型，这是最常用的方法。当然还有其他的方法。

请注意，对于每个预测，它都会得到相同的UI截图。所以如果要预测20个单词的话，它会得到20次相同的UI截图。不过现在别纠结神经网络是怎样工作的，重点是神经网络的输入输出。

让我们把重点放在previous markup上。假设我们训练网络来预测“I can code”的句子。当收到“I”时，那么它会预测“can”。下一次它会收到“I can”并预测“code”。它收到所有以前的单词后，只需要预测下一个单词是什么。

神经网络从数据中创建特征。神经网络构建了链接输入数据与输出数据的功能。它必须创建一个描述，以了解它预测的每个屏幕截图中的内容（HTML语法）。这就建立了预测下一个标签的知识。

当你想使用训练好的模型时，和训练时类似。每次使用相同的UI截图逐一生成文本。它并不提供正确的HTML标签，而是通过它产生的当前的标记。然后预测下一个标签。预测以“start tag”开始，并在预测到“end tag”或达到最大限度时停止。在Google表格中有另一个示例。

Hello World版本

我先建立一个Hello World版本。我们给神经网络提供一张截图让其生成一个显示“Hello World！”的网站，并训练它生成HTML标签。

首先，神经网络将UI设计图映射成像素值列表，取值范围0 - 255，分别对应三个通道 - 红色，蓝色和绿色。

为了用神经网络理解的方式表示标记，我使用了"独热码（one hot encoding）"(https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/)。因此，“I can code”的句子可以像下面这样映射。

在上图中，我们包含了开始和结束标签。这些标签是神经网络开始预测和何时停止的线索。

对于输入数据，我们将使用句子，从第一个单词开始，然后逐个添加每个单词。输出数据总是一个单词。

句子遵循与单词相同的逻辑。他们也需要相同的输入长度。他们没有最大句子长度的约束。如果它比最大长度短，则用空字填充它，这个词对应的值就是零。

如你所见，单词是从右到左打印的，每个单词会改变其在每轮训练中的位置。这样会使模型学习序列而不是记住每个单词的位置。

在下面的图表中有四个预测。每一行对应一个预测。左边是以三种颜色表示的图像：红色，绿色和蓝色以及之前的单词。括号之外，是一个接一个的预测，最后以红色方块结束。

        #Length of longest sentence        max_caption_len = 3        #Size of vocabulary         vocab_size = 3        # Load one screenshot for each word and turn them into digits         images = []        for i in range(2):                images.append(img_to_array(load_img('screenshot.jpg', target_size=(224, 224))))        images = np.array(images, dtype=float)        # Preprocess input for the VGG16 model        images = preprocess_input(images)        #Turn start tokens into one-hot encoding        html_input = np.array(                                [[[0., 0., 0.], #start                                 [0., 0., 0.],                                 [1., 0., 0.]],                                 [[0., 0., 0.], #start <HTML>Hello World!</HTML>                                 [1., 0., 0.],                                 [0., 1., 0.]]])        #Turn next word into one-hot encoding        next_words = np.array(                                [[0., 1., 0.], # <HTML>Hello World!</HTML>                                 [0., 0., 1.]]) # end        # Load the VGG16 model trained on imagenet and output the classification feature        VGG = VGG16(weights='imagenet', include_top=True)        # Extract the features from the image        features = VGG.predict(images)        #Load the feature to the network, apply a dense layer, and repeat the vector        vgg_feature = Input(shape=(1000,))        vgg_feature_dense = Dense(5)(vgg_feature)        vgg_feature_repeat = RepeatVector(max_caption_len)(vgg_feature_dense)        # Extract information from the input seqence         language_input = Input(shape=(vocab_size, vocab_size))        language_model = LSTM(5, return_sequences=True)(language_input)        # Concatenate the information from the image and the input        decoder = concatenate([vgg_feature_repeat, language_model])        # Extract information from the concatenated output        decoder = LSTM(5, return_sequences=False)(decoder)        # Predict which word comes next        decoder_output = Dense(vocab_size, activation='softmax')(decoder)        # Compile and run the neural network        model = Model(inputs=[vgg_feature, language_input], outputs=decoder_output)        model.compile(loss='categorical_crossentropy', optimizer='rmsprop')        # Train the neural network        model.fit([features, html_input], next_words, batch_size=2, shuffle=False, epochs=1000)

在Hello World版中，我们使用三个令牌：“start”，“Hello World！”和“end”。令牌可以是任何东西。它可以是一个字符，单词或句子。字符版需要较小的词汇量，但是会限制神经网络。词级令牌往往表现最好。

这里我们做一下预测：

    # Create an empty sentence and insert the start token    sentence = np.zeros((1, 3, 3)) # [[0,0,0], [0,0,0], [0,0,0]]    start_token = [1., 0., 0.] # start    sentence[0][2] = start_token # place start in empty sentence    # Making the first prediction with the start token    second_word = model.predict([np.array([features[1]]), sentence])    # Put the second word in the sentence and make the final prediction    sentence[0][1] = start_token    sentence[0][2] = np.round(second_word)    third_word = model.predict([np.array([features[1]]), sentence])    # Place the start token and our two predictions in the sentence     sentence[0][0] = start_token    sentence[0][1] = np.round(second_word)    sentence[0][2] = np.round(third_word)    # Transform our one-hot predictions into the final tokens    vocabulary = ["start", "<HTML><center><H1>Hello World!</H1></center></HTML>", "end"]    for i in sentence[0]:        print(vocabulary[np.argmax(i)], end=' ')

输出

10 期训练: start start start
100 期训练: start<HTML><center><H1>HelloWorld!</H1></center></HTML> <HTML><center><H1>Hello World!</H1></center></HTML>
300 期训练: start<HTML><center><H1>HelloWorld!</H1></center></HTML>end

我所犯的错误：

在收集数据之前构建第一个工作版本 。在这个项目的早期，我设法得到了一个Geocities托管网站的旧存档的副本，它有3800万个网站。但是我忽略了一点，那就是将其精简到到100K所需的巨大工作量。
处理一个TB级的数据需要很好的硬件和更多的耐心。 在我的Mac遇到几个问题后，我最终使用了一个功能强大的远程服务器。预计需要租用一个具有8个现代CPU核心和1GPS带宽的设备，才能流畅的工作。
除非我理解输入和输出的数据，否则没有任何意义 。输入X是一个屏幕截图和上一个标记的标签。输出Y是下一个标注的标签。当我明白这一点时，理解它们之间的一切变得更容易了。尝试不同的体系结构也变得更加容易。
注意避免入坑 。因为这个项目在深度学习中与很多领域相交叉，所以我一路入了许多坑。我花了一个星期从头开始编写RNN，对嵌入向量空间感到非常着迷，并且被具有异国情调的实现诱惑了。
Picture-to-code网络是伪装的图像标注模型。 即使当我了解到这一点，我仍然忽略了许多图像标注方面的论文，只是因为它们不太酷。一旦我有了一些观点，我就加快了对问题空间的了解。

在FloydHub上运行代码

FloydHub是一个深度学习的培训平台。在我刚开始学习深度学习的时候发现了它，而且我用它们来训练和管理我的深度学习实验。你可以在10分钟内安装并运行自己的第一个模型。这是在云端GPU跑模型的最佳选择。

如果你是FloydHub的新手，那就先学习他们的2-min installation或my 5-minute walkthrough。

克隆代码库

git clone https://github.com/emilwallner/Screenshot-to-code-in-Keras.git

登录并启动FloydHub命令行工具

cd Screenshot-to-code-in-Kerasfloyd loginfloyd init s2c

在FloydHub云GPU机器上运行一个Jupyter notebook：

floyd run --gpu --env tensorflow-1.4 --data emilwallner/datasets/imagetocode/2:data --mode jupyter

所有notebooks都在floydhub目录中。本地文件在local中。一旦运行，你可以在这个路径下找到第一个notebook：floydhub/Helloworld/helloworld.ipynb。

如果你需要更详细的说明和标志的解释，请查看我以前的文章。

HTML版本

在这个版本中，我们将自动执行Hello World模型的一些步骤。本部分将把重点放在创建一个可伸缩的实现和神经网络中的移动片断。

这个版本将无法预测来自随机网站的HTML，但它仍然是一个用来探索动态问题的案例。

概述

如果我们展开先前图形的组件，看起来像这样。

它有两个主要部分。首先是编码器。这是我们创建图像功能和上一个标记功能的地方。功能是网络创建连接设计模型和标记的构建块。在编码器的末尾，我们将图像特征附加到前一个标记中的每个单词上。

然后解码器采用组合设计与标记功能和建下一个标签功能。这个特征是通过完全连接的神经网络来预测下一个标签。
设计原型功能

由于我们需要为每个单词插入一个截图，这成为了训练网络时的瓶颈（例子:https://docs.google.com/spreadsheets/d/1xXwarcQZAHluorveZsACtXRdmNFbwGtN3WMNhcTdEyQ/edit#gid=0）。所以我们不使用图像，而是提取生成标记所需的信息。

信息被编码成图像特征。这是通过使用已经预先训练的卷积神经网络（CNN）完成的。该模型是在Imagenet上预先训练的。

在最终分类之前，我们从图层中提取特征。

我们最终得到了1536个八乘八像素的图像，称为特征。虽然对我们来说很难理解，但是神经网络可以从这些特征中提取元素的对象和位置。
标记功能

在hello world版本中，我们使用了独热码来表示标记。在这个版本中，我们使用词嵌入作为输入，并保留一个热独码作为输出。

构造每个句子的方式保持不变，除了如何映射每个发生变化的令牌。一个热独码将每个单词视为一个孤立的单元。相应的，我们将输入数据中的每个单词转换为数字列表。这些代表了标记标签之间的关系。

这个词嵌入的维度是八，但经常在50-500之间变化，这取决于词汇量的大小。

每个单词的八位数字的权重类似于循环神经网络。他们倾向于绘制词语之间的相互关系(Mikolov alt el., 2013)。

这些就是该如何开始开发标记功能。神经网络的特点是将输入数据与输出数据链接起来。现在，别纠结它们是什么，下一节我们将会深入探讨这一点。

编码器

我们将这个词嵌入，并通过一个LSTM运行它们，并返回一系列的标记功能。这些通过时间分布密集层运行 - 将其视为具有多个输入和输出的密集层。

图像特征首先被平坦化。无论这些数字在哪里被构造成一个大的数字列表。然后我们在这个层上应用一个密集的层来形成一个高层次的特征。然后这些图像功能被连接到标记功能。

你对此可能很难理解——所以接下来把它们分解。
标记功能

在这里我们通过LSTM层运行字嵌入。在这个图形中，所有句子都被填充以达到三个令牌的最大尺寸。

为了混合信号并找到更高级别的模式，我们将TimeDistributed dense层应用于标记功能。 TimeDistributed dense与稠密层相同，但具有多个输入和输出。
图像功能

同时，我们准备图像。我们把所有的迷你图片功能，并将其转换成一个长列表。包含的信息没变，只是重新组织。

再次混合信号和提取更高层次的概念，我们应用一个稠密层。由于只需要处理一个输入值，所以用一个正常的稠密层。为了将图像特征连接到标记特征，我们复制图像特征。

在这种情况下，有了三个标记功能。因此，会得到相同数量的图像特征和标记特征。
连接图像和标记功能

所有的句子用来创建三个标记特征。由于之前已经准备好了图像特征，现在就可以为每个标记特征添加一个图像特征。

在将一个图像特征粘贴到每个标记特征之后，结束三个图像标记特征。这是我们给解码器的输入。

解码器

在这里我们使用组合的图像标记功能来预测下一个标记。

在下面的例子中，我们使用三个图像标记特征对并输出一个“下一个标签”特征。

请注意，LSTM图层的序列设置为false。而不是返回输入序列的长度，它只能预测一个特征。在我们的例子中，它是“下一个标签”的一个特征。它包含最终预测的信息。

最后的预测

稠密层像传统的前馈神经网络一样工作。它将下一个标记要素中512个数字与4个最终预测连接起来。假设我们的词汇有四个词：start, hello, world, and end。

词汇预测可以是[0.1,0.1,0.1,0.7]。稠密层中的softmax激活分布概率从0到1，所有预测的总和等于1.在这种情况下，它预测第4个字是下一个标记。然后，将独热码[0，0，0，1]转换为映射值，例如“end”。

    # Load the images and preprocess them for inception-resnet    images = []    all_filenames = listdir('images/')    all_filenames.sort()    for filename in all_filenames:        images.append(img_to_array(load_img('images/'+filename, target_size=(299, 299))))    images = np.array(images, dtype=float)    images = preprocess_input(images)    # Run the images through inception-resnet and extract the features without the classification layer    IR2 = InceptionResNetV2(weights='imagenet', include_top=False)    features = IR2.predict(images)    # We will cap each input sequence to 100 tokens    max_caption_len = 100    # Initialize the function that will create our vocabulary     tokenizer = Tokenizer(filters='', split=" ", lower=False)    # Read a document and return a string    def load_doc(filename):        file = open(filename, 'r')        text = file.read()        file.close()        return text    # Load all the HTML files    X = []    all_filenames = listdir('html/')    all_filenames.sort()    for filename in all_filenames:        X.append(load_doc('html/'+filename))    # Create the vocabulary from the html files    tokenizer.fit_on_texts(X)    # Add +1 to leave space for empty words    vocab_size = len(tokenizer.word_index) + 1    # Translate each word in text file to the matching vocabulary index    sequences = tokenizer.texts_to_sequences(X)    # The longest HTML file    max_length = max(len(s) for s in sequences)    # Intialize our final input to the model    X, y, image_data = list(), list(), list()    for img_no, seq in enumerate(sequences):        for i in range(1, len(seq)):            # Add the entire sequence to the input and only keep the next word for the output            in_seq, out_seq = seq[:i], seq[i]            # If the sentence is shorter than max_length, fill it up with empty words            in_seq = pad_sequences([in_seq], maxlen=max_length)[0]            # Map the output to one-hot encoding            out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]            # Add and image corresponding to the HTML file            image_data.append(features[img_no])            # Cut the input sentence to 100 tokens, and add it to the input data            X.append(in_seq[-100:])            y.append(out_seq)    X, y, image_data = np.array(X), np.array(y), np.array(image_data)    # Create the encoder    image_features = Input(shape=(8, 8, 1536,))    image_flat = Flatten()(image_features)    image_flat = Dense(128, activation='relu')(image_flat)    ir2_out = RepeatVector(max_caption_len)(image_flat)    language_input = Input(shape=(max_caption_len,))    language_model = Embedding(vocab_size, 200, input_length=max_caption_len)(language_input)    language_model = LSTM(256, return_sequences=True)(language_model)    language_model = LSTM(256, return_sequences=True)(language_model)    language_model = TimeDistributed(Dense(128, activation='relu'))(language_model)    # Create the decoder    decoder = concatenate([ir2_out, language_model])    decoder = LSTM(512, return_sequences=False)(decoder)    decoder_output = Dense(vocab_size, activation='softmax')(decoder)    # Compile the model    model = Model(inputs=[image_features, language_input], outputs=decoder_output)    model.compile(loss='categorical_crossentropy', optimizer='rmsprop')    # Train the neural network    model.fit([image_data, X], y, batch_size=64, shuffle=False, epochs=2)    # map an integer to a word    def word_for_id(integer, tokenizer):        for word, index in tokenizer.word_index.items():            if index == integer:                return word        return None    # generate a description for an image    def generate_desc(model, tokenizer, photo, max_length):        # seed the generation process        in_text = 'START'        # iterate over the whole length of the sequence        for i in range(900):            # integer encode input sequence            sequence = tokenizer.texts_to_sequences([in_text])[0][-100:]            # pad input            sequence = pad_sequences([sequence], maxlen=max_length)            # predict next word            yhat = model.predict([photo,sequence], verbose=0)            # convert probability to integer            yhat = np.argmax(yhat)            # map integer to word            word = word_for_id(yhat, tokenizer)            # stop if we cannot map the word            if word is None:                break            # append as input for generating the next word            in_text += ' ' + word            # Print the prediction            print(' ' + word, end='')            # stop if we predict the end of the sequence            if word == 'END':                break        return    # Load and image, preprocess it for IR2, extract features and generate the HTML    test_image = img_to_array(load_img('images/87.jpg', target_size=(299, 299)))    test_image = np.array(test_image, dtype=float)    test_image = preprocess_input(test_image)    test_features = IR2.predict(np.array([test_image]))    generate_desc(model, tokenizer, np.array(test_features), 100)

输出

生成的网站链接

250 epochs
350 epochs
450 epochs
550 epochs

如果点击这些链接时看不到任何东西，可以右键单击“查看页面源代码”。这是原网站供你参考。

我犯的错误：

与CNN相比，LSTM对我的认知要重要得多。当我展开所有的LSTM时，他们变得更容易理解。 Fast.ai的RNN视频（http://course.fast.ai/lessons/lesson6.html）非常有用。另外，在尝试了解它们的工作方式之前，请先关注输入和输出功能。
- 从头开始建立词汇比缩小巨大的词汇要容易得多。这包括从字体，div大小，十六进制颜色到变量名称和普通单词的所有内容。

创建大量的库是为了解析文本文档而不是代码。在文档中，所有内容都由空格分隔，但在代码中，你需要自定义分析。
你可以用在Imagenet上训练的模型提取功能。这可能看起来违反直觉，因为Imagenet几乎没有Web图像。不过与从零开始训练的pix2code模型相比，损失高出30％。我会很有趣地使用基于网页截图的pre-train inception-resnet类型的模型。

Bootstrap版

在我们的最终版本中，我们将使用pix2code论文生成的bootstrap程序网站的数据集。通过使用Twitter的bootstrap，可以结合HTML和CSS，并减少代码量。

我们将使它能够为之前没有处理过的截图生成标记。我们还将深入研究如何构建关于屏幕截图和标记的知识。

我们将使用17个简化的令牌，然后将其转换为HTML和CSS，而不是在引导标记上进行培训。数据集包括1500个测试截图和250个验证图像。每个屏幕截图平均有65个令牌，导致96925个训练样例。

依据pix2code论文调整模型，该模型可以以97％的准确度预测Web组件（BLEU 4-ngram贪心搜索，稍后介绍）。

端到端的方法

从pre-trained模型中提取特征在图像字幕模型中效果很好。但经过几次实验后，我意识到pix2code的端到端方法可以更好地解决这个问题。pre-trained的模型还没有接受网络数据的培训，并且是为了分类而定制的。

在这个模型中，我们用光线卷积神经网络来替换预先训练好的图像特征。我们没有使用最大化池来增加信息密度，而是增加了步伐。这保持了前端元素的位置和颜色。

卷积神经网络（CNN）和递归神经网络（RNN）有两种核心模型。最常见的经常性神经网络是长期记忆（LSTM），所以这就是我所指的。

有很多很棒的CNN教程，我在之前的文章中介绍过。在这里，我将重点介绍LSTM。

了解LSTM中的timesteps

关于LSTM的难点之一是timesteps。递归神经网络可以被认为是两个timesteps。如果你给它“Hello”，它预测“World”。但是，要预测更多的timesteps是很困难的。在下面的例子中，输入有四个timesteps，每个单词一个。

LSTMs是用timesteps输入的。这是一个按顺序定制信息的神经网络。如果你展开我们的模型，看起来像这样。对于每一个向下的步骤，你保持相同的权重。您将一组权重应用于之前的输出，另一组设置为新的输入。

加权输入和输出连接在一起，并与激活一起添加。这是那个timesteps的输出。由于我们重用权重，他们从几个输入中提取信息，并建立序列的知识。

以下是LSTM中每个timesteps的简化版本。

为了感受这个逻辑，我建议用Andrew Trask的精彩教程从头开始构建一个RNN。

了解LSTM图层中的units

每个LSTM层的units决定了它的记忆能力。这也对应于每个输出特征的大小。同样，一个功能是用来在层之间传输信息的一长串数字。

LSTM层中的每个单元学习跟踪语法的不同方面。下面是一个单位的可视化，保持在行div的信息跟踪。这是我们用来训练bootstrap模型的简化标记。

每个LSTM单元保持一个cell state。把cell state想象成记忆。权重和激活用来以不同的方式修改状态。这使得LSTM层能够微调每个输入要保留和丢弃哪些信息。

除了通过每个输入的输出特征之外，它还转发cell state，LSTM中每个单元的一个值。为了感受LSTM中的组件如何相互作用，我推荐Colah的教程，Jayasiri的Numpy实现，以及Karphay的讲座和教程。

    dir_name = 'resources/eval_light/'    # Read a file and return a string    def load_doc(filename):        file = open(filename, 'r')        text = file.read()        file.close()        return text    def load_data(data_dir):        text = []        images = []        # Load all the files and order them        all_filenames = listdir(data_dir)        all_filenames.sort()        for filename in (all_filenames):            if filename[-3:] == "npz":                # Load the images already prepared in arrays                image = np.load(data_dir+filename)                images.append(image['features'])            else:                # Load the boostrap tokens and rap them in a start and end tag                syntax = '<START> ' + load_doc(data_dir+filename) + ' <END>'                # Seperate all the words with a single space                syntax = ' '.join(syntax.split())                # Add a space after each comma                syntax = syntax.replace(',', ' ,')                text.append(syntax)        images = np.array(images, dtype=float)        return images, text    train_features, texts = load_data(dir_name)    # Initialize the function to create the vocabulary     tokenizer = Tokenizer(filters='', split=" ", lower=False)    # Create the vocabulary     tokenizer.fit_on_texts([load_doc('bootstrap.vocab')])    # Add one spot for the empty word in the vocabulary     vocab_size = len(tokenizer.word_index) + 1    # Map the input sentences into the vocabulary indexes    train_sequences = tokenizer.texts_to_sequences(texts)    # The longest set of boostrap tokens    max_sequence = max(len(s) for s in train_sequences)    # Specify how many tokens to have in each input sentence    max_length = 48    def preprocess_data(sequences, features):        X, y, image_data = list(), list(), list()        for img_no, seq in enumerate(sequences):            for i in range(1, len(seq)):                # Add the sentence until the current count(i) and add the current count to the output                in_seq, out_seq = seq[:i], seq[i]                # Pad all the input token sentences to max_sequence                in_seq = pad_sequences([in_seq], maxlen=max_sequence)[0]                # Turn the output into one-hot encoding                out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]                # Add the corresponding image to the boostrap token file                image_data.append(features[img_no])                # Cap the input sentence to 48 tokens and add it                X.append(in_seq[-48:])                y.append(out_seq)        return np.array(X), np.array(y), np.array(image_data)    X, y, image_data = preprocess_data(train_sequences, train_features)    #Create the encoder    image_model = Sequential()    image_model.add(Conv2D(16, (3, 3), padding='valid', activation='relu', input_shape=(256, 256, 3,)))    image_model.add(Conv2D(16, (3,3), activation='relu', padding='same', strides=2))    image_model.add(Conv2D(32, (3,3), activation='relu', padding='same'))    image_model.add(Conv2D(32, (3,3), activation='relu', padding='same', strides=2))    image_model.add(Conv2D(64, (3,3), activation='relu', padding='same'))    image_model.add(Conv2D(64, (3,3), activation='relu', padding='same', strides=2))    image_model.add(Conv2D(128, (3,3), activation='relu', padding='same'))    image_model.add(Flatten())    image_model.add(Dense(1024, activation='relu'))    image_model.add(Dropout(0.3))    image_model.add(Dense(1024, activation='relu'))    image_model.add(Dropout(0.3))    image_model.add(RepeatVector(max_length))    visual_input = Input(shape=(256, 256, 3,))    encoded_image = image_model(visual_input)    language_input = Input(shape=(max_length,))    language_model = Embedding(vocab_size, 50, input_length=max_length, mask_zero=True)(language_input)    language_model = LSTM(128, return_sequences=True)(language_model)    language_model = LSTM(128, return_sequences=True)(language_model)    #Create the decoder    decoder = concatenate([encoded_image, language_model])    decoder = LSTM(512, return_sequences=True)(decoder)    decoder = LSTM(512, return_sequences=False)(decoder)    decoder = Dense(vocab_size, activation='softmax')(decoder)    # Compile the model    model = Model(inputs=[visual_input, language_input], outputs=decoder)    optimizer = RMSprop(lr=0.0001, clipvalue=1.0)    model.compile(loss='categorical_crossentropy', optimizer=optimizer)    #Save the model for every 2nd epoch    filepath="org-weights-epoch-{epoch:04d}--val_loss-{val_loss:.4f}--loss-{loss:.4f}.hdf5"    checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_weights_only=True, period=2)    callbacks_list = [checkpoint]    # Train the model    model.fit([image_data, X], y, batch_size=64, shuffle=False, validation_split=0.1, callbacks=callbacks_list, verbose=1, epochs=50)

测试准确性

找到一个公平的方法来衡量准确性是非常棘手的。比如逐字进行比较。如果你的预测有一个字不同步，可能就是0％的准确性。如果删除一个同步预测的单词，则可能以99/100结束。

我使用了BLEU评分、机器翻译的最佳实践和图像字幕模型。它把这个句子从1-4个单词序列中分解成四个n-gram。在下面的预测中，“cat”应该是“code”。

为了得到最终的分数，我们把每个分数乘以25%：（4/5） 0.25 +（2/4） 0.25 +（1/3） 0.25 +（0/2） 0.25 = 0.2 + 0.125 + 0.083 + 0 = 0.408。然后把这个和乘以句子长度惩罚。由于在我们的例子中长度是正确的，它就成了我们的最终得分。

你可以增加n-gram的数量来使它更难。四个n-gram模型是最符合人类翻译的模型。我建议使用下面的代码运行几个例子并阅读wiki页面（https://en.wikipedia.org/wiki/BLEU）。

    #Create a function to read a file and return its content    def load_doc(filename):        file = open(filename, 'r')        text = file.read()        file.close()        return text    def load_data(data_dir):        text = []        images = []        files_in_folder = os.listdir(data_dir)        files_in_folder.sort()        for filename in tqdm(files_in_folder):            #Add an image            if filename[-3:] == "npz":                image = np.load(data_dir+filename)                images.append(image['features'])            else:            # Add text and wrap it in a start and end tag                syntax = '<START> ' + load_doc(data_dir+filename) + ' <END>'                #Seperate each word with a space                syntax = ' '.join(syntax.split())                #Add a space between each comma                syntax = syntax.replace(',', ' ,')                text.append(syntax)        images = np.array(images, dtype=float)        return images, text    #Intialize the function to create the vocabulary    tokenizer = Tokenizer(filters='', split=" ", lower=False)    #Create the vocabulary in a specific order    tokenizer.fit_on_texts([load_doc('bootstrap.vocab')])    dir_name = '../../../../eval/'    train_features, texts = load_data(dir_name)    #load model and weights     json_file = open('../../../../model.json', 'r')    loaded_model_json = json_file.read()    json_file.close()    loaded_model = model_from_json(loaded_model_json)    # load weights into new model    loaded_model.load_weights("../../../../weights.hdf5")    print("Loaded model from disk")    # map an integer to a word    def word_for_id(integer, tokenizer):        for word, index in tokenizer.word_index.items():            if index == integer:                return word        return None    print(word_for_id(17, tokenizer))    # generate a description for an image    def generate_desc(model, tokenizer, photo, max_length):        photo = np.array([photo])        # seed the generation process        in_text = '<START> '        # iterate over the whole length of the sequence        print('\nPrediction---->\n\n<START> ', end='')        for i in range(150):            # integer encode input sequence            sequence = tokenizer.texts_to_sequences([in_text])[0]            # pad input            sequence = pad_sequences([sequence], maxlen=max_length)            # predict next word            yhat = loaded_model.predict([photo, sequence], verbose=0)            # convert probability to integer            yhat = argmax(yhat)            # map integer to word            word = word_for_id(yhat, tokenizer)            # stop if we cannot map the word            if word is None:                break            # append as input for generating the next word            in_text += word + ' '            # stop if we predict the end of the sequence            print(word + ' ', end='')            if word == '<END>':                break        return in_text    max_length = 48     # evaluate the skill of the model    def evaluate_model(model, descriptions, photos, tokenizer, max_length):        actual, predicted = list(), list()        # step over the whole set        for i in range(len(texts)):            yhat = generate_desc(model, tokenizer, photos[i], max_length)            # store actual and predicted            print('\n\nReal---->\n\n' + texts[i])            actual.append([texts[i].split()])            predicted.append(yhat.split())        # calculate BLEU score        bleu = corpus_bleu(actual, predicted)        return bleu, actual, predicted    bleu, actual, predicted = evaluate_model(loaded_model, texts, train_features, tokenizer, max_length)    #Compile the tokens into HTML and css    dsl_path = "compiler/assets/web-dsl-mapping.json"    compiler = Compiler(dsl_path)    compiled_website = compiler.compile(predicted[0], 'index.html')    print(compiled_website )    print(bleu)

输出

链接到示例输出

Generated website 1 - Original 1
Generated website 2 - Original 2
Generated website 3 - Original 3
Generated website 4 - Original 4
Generated website 5 - Original 5

我犯的错误：

了解模型的弱点，而不是测试随机模型。首先我应用了批量规范化，双向网络等随机事件，并尝试着实施。看了测试数据，发现无法准确预测颜色和位置，我意识到CNN有一个弱点。这导致我用更大的步幅来取代maxpool。验证损失从0.12变为0.02，BLEU得分从85％提高到97％。
只有在相关的情况下才使用预先训练好的模型 。鉴于小数据集，我认为预先训练的图像模型会提高性能。从我的实验来看，端到端的模型训练起来比较慢，需要更多的内存，但是精度要高出30％。
在远程服务器上运行模型时，计划稍有差异 。在我的Mac上，它按字母顺序读取文件。但是，在服务器上，它是随机定位的。这造成了截图和代码之间的不匹配。它仍然收敛，但验证数据比我修正时差50％。
确保你了解库函数 。为您的词汇表中的空标记添加空格。当我没有添加它，它不包括一个令牌。我只在注意到最后的输出几次之后才注意到它，并注意到它从来没有预测到“单一”的标记。经过快速检查，我意识到这甚至不在词汇表中。另外，在词汇表中使用相同的顺序进行训练和测试。
试验时使用较轻的型号 。使用GRU而不是LSTM将每个历元周期减少了30％，并且对性能没有太大的影响。

下一步

前端开发是应用深度学习的理想空间。生成数据很容易，目前的深度学习算法可以映射大部分的逻辑。

最令人兴奋的领域之一是关注LSTMs。这不仅会提高准确性，而且还会使我们能够直观地看到CNN在产生标记时将焦点放在哪里。

注意力也是标记，样式表，脚本和最终后台之间沟通的关键。注意层可以跟踪变量，使网络能够在编程语言之间进行通信。

但是在附近的功能，最大的影响将来自建立一个可扩展的方式来综合数据。然后，您可以逐步添加字体，颜色，文字和动画。

到目前为止，大多数的进展都在进行草图并将其转化为模板应用程序。在不到两年的时间里，我们就可以在纸上画一个应用程序，并在不到一秒的时间内就可以获得相应的前端。 Airbnb的设计团队和Uizard已经建立了两个工作原型。

以下是一些入门实验。

实验

入门

运行所有模型
尝试不同的超参数
测试一个不同的CNN架构
添加双向LSTM模型
- 用不同的数据集来实现模型。（你可以用这个标志 --data emilwallner/datasets/100k-html:data轻松地将这个数据集挂载到FloydHub作业中）

进一步的实验

使用相应的语法创建一个可靠的随机app/web生成器。
草图到应用模型的数据。将应用app/web截图自动转换为草图，并使用GAN创建多种类型。
应用关注图层，可视化每个预测的图像焦点，类似于这个模型。
为模块化方法创建一个框架。比如，有字体的编码器模型、颜色的编码器模型，另一个布局的编码器模型，并用一个解码器整合他们。稳定的图像功能是一个良好的开端。
训练生成简单HTML组件的神经网络，训练它使用CSS生成动画。如果输入源同时具备 attention approach和可视化焦点是非常令人着迷的。

非常感谢Tony Beltramelli和Jon Gold的悉心指导，还有他们的研究和所有的想法。感谢Jason Brownlee编写的精彩的Keras教程，我使用他教程中Keras核心实现中的一些代码片段，以及Beltramelli提供的数据。同样感谢 Qingping Hou，Charlie Harrington, Sai Soundararaj, Jannes Klaas, Claudio Cabral, Alain Demenet和Dylan Djian阅读这个草稿。

关于作者Emil Wallner

这是埃米尔学习深度学习的多部分博客系列的第四部分。埃米尔花了十年时间探索人类学习。他曾在牛津商学院工作，投资教育初创公司，并创建了教育技术服务。去年，他参加了Ecole 42，将他的人类学习相关的知识应用于机器学习。

1）给训练好的神经网络提供设计图像

2）神经网络将图像转换成HTML标记

3 ）渲染输出

核心逻辑

Hello World版本

输出

我所犯的错误：

在FloydHub上运行代码

HTML版本

概述

编码器

解码器

最后的预测

输出

生成的网站链接

我犯的错误：

Bootstrap版

端到端的方法

了解LSTM中的timesteps

了解LSTM图层中的units

测试准确性

输出

我犯的错误：

下一步

实验

入门

进一步的实验

关于作者Emil Wallner

更多相关文章

随机推荐