连接相关标签的最佳数据库设计

Here is my current table structure for tags:

这是我当前的标签表格结构:

// tags
+----+------------+----------------------------------+----------+------------+
| id |    name    |            description           | used_num | date_time  |
+----+------------+----------------------------------+----------+------------+
| 1  | PHP        | some explanations for PHP        | 4234     | 1475028896 |
| 2  | SQL        | some explanations for SQL        | 734      | 1475048601 |
| 3  | jQuery     | some explanations for jQuery     | 434      | 1475068321 | 
| 4  | MySQL      | some explanations for MySQL      | 535      | 1475068332 |
| 5  | JavaScript | some explanations for JavaScript | 3325     | 1475071430 |
| 6  | HTML       | some explanations for HTML       | 2133     | 1475077842 |
| 7  | postgresql | some explanations for postgresql | 43       | 1475077851 |
+----+------------+----------------------------------+----------+------------+

As you know, some tags are related to each other. For example:

如您所知,某些标签彼此相关。例如:

SQL, MySQL, postgresql

SQL,MySQL,postgresql

JavaScript, jQuery

are related ones in table above. How can I make that relation between them? Should I add one more column? Should it be containing what thing? (since sometimes there are more than 2 related tags)

是上表中的相关内容。我怎样才能在他们之间建立这种关系?我应该再添加一列吗?它应该包含什么东西? (因为有时候有两个以上的相关标签)

2 个解决方案

#1

For Option 1 at least, please see Edit 1 at bottom for the actual INSERT strategy for the intersect.

至少对于选项1,请参见底部的编辑1,了解相交的实际INSERT策略。

Option 1

create table tags
(   id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(60) NOT NULL,
    UNIQUE KEY `key_tags_name` (name)
    -- All the other columns
)ENGINE=InnoDB;

create table tagIntersects
(   id1 INT NOT NULL,
    id2 INT NOT NULL,
    PRIMARY KEY(id1,id2),
    KEY `ti_flipped` (id2,id1), -- flipped left-mode (thin size)
    FOREIGN KEY `fk_ti_id1` (id1) REFERENCES tags(id),
    FOREIGN KEY `fk_ti_id2` (id2) REFERENCES tags(id)
)ENGINE=InnoDB;

Option 2

create table tags
(   id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(60) NOT NULL,
    UNIQUE KEY `key_tags_name` (name)
    -- All the other columns
)ENGINE=InnoDB;

create table tagIntersects
(   id INT AUTO_INCREMENT PRIMARY KEY,
    name1 VARCHAR(60) NOT NULL,
    name2 VARCHAR(60) NOT NULL,
    KEY `ti_Flipped` (name2,name1), -- these get costly (wide)
    FOREIGN KEY `fk_ti_id1` (name1) REFERENCES tags(name),
    FOREIGN KEY `fk_ti_id2` (name2) REFERENCES tags(name)
)ENGINE=InnoDB;

Option 3

create table tags
(   id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(60) NOT NULL,
    UNIQUE KEY `key_tags_name` (name)
    -- All the other columns
)ENGINE=InnoDB;

create table tagIntersects
(   name1 VARCHAR(60) NOT NULL,
    name2 VARCHAR(60) NOT NULL,
    PRIMARY KEY (name1,name2),
    KEY `ti_Flipped` (name2,name1), -- these get costly (wide)
    FOREIGN KEY `fk_ti_id1` (name1) REFERENCES tags(name),
    FOREIGN KEY `fk_ti_id2` (name2) REFERENCES tags(name)
)ENGINE=InnoDB;

INSERT tags (name) VALUES ('PHP'),('PDO'),('MYSQLI'),('PHPMyAdmin');

I recommend Option 1. Below is all about Option 1

Fake load 200 tags with random names:

假装200个随机名称的标签:

DROP PROCEDURE IF EXISTS tagDataLoad;
DELIMITER $$
CREATE PROCEDURE tagDataLoad()
BEGIN
    -- warning this is horribly slow
    DECLARE i INT DEFAULT 0;
    WHILE i<200 DO
        INSERT IGNORE tags(name)
        select concat(substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', rand()*36+1, 1),
              substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', rand()*36+1, 1),
              substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', rand()*36+1, 1),
              substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', rand()*36+1, 1),
              substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', rand()*36+1, 1),
              substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', rand()*36+1, 1),
              substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', rand()*36+1, 1),
              substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', rand()*36+1, 1)
             );
        -- poach the above from Gordon, link: https://stackoverflow.com/a/16738136
        SET i=i+1;
    END WHILE;
END$$
DELIMITER ;

call it:

CALL tagDataLoad(); -- load 200 fake tags
select * from tags; -- eyeball it
CALL tagDataLoad(); -- load more
CALL tagDataLoad(); -- load more
SELECT MIN(id),MAX(id),count(*) FROM tags;
-- 1 604 604

Fake load qty iCount number of fake tag intersects:

假负载数量iCount伪标签数量相交:

DROP PROCEDURE IF EXISTS tagIntersectDataLoad;
DELIMITER $$
CREATE PROCEDURE tagIntersectDataLoad(iCount INT)
BEGIN
    -- warning this is horribly slow
    -- don't pass a number greater than 100 until you time it
    DECLARE i INT DEFAULT 0;
    WHILE i<=iCount DO
        INSERT IGNORE tagIntersects(id1,id2)
        SELECT FLOOR(RAND()*600)+1,FLOOR(RAND()*600)+1;
        SET i=i+1;
    END WHILE;
END$$
DELIMITER ;

CALL tagIntersectDataLoad(100);
-- slow, I don't recommend a number above 100 until you time it

After changing some 100's to larger counts I ended up with 10k row

在将大约100个更改为更大的计数后,我最终得到了10k行

select count(*) from tag Intersects;
-- 9900

I don't recommend you do that due to timeouts. But in the end I had the above

由于超时,我不建议您这样做。但最后我有了上述内容

Half the reason for the fake load stored procs above are for just getting the table size high enough that indexes are even used. They aren't used for small tables. Also they give you a method to chg to other schemas and load data special for them. And then to profile the performance with your queries with 100k rows or tens of millions (depending on your needs).

上面假加载存储过程的一半原因是为了使表大小足够高,甚至使用索引。它们不用于小桌子。他们还为您提供了一种方法来chg到其他模式并加载特殊的数据。然后使用100k行或数千万的查询(根据您的需要)分析性能。

See EXPLAIN plan:

请参阅EXPLAIN计划:

explain 
select * from tagIntersects where id1=111 or id2=111;
+----+-------------+---------------+-------------+--------------------+--------------------+---------+------+------+----------------------------------------------+
| id | select_type | table         | type        | possible_keys      | key                | key_len | ref  | rows | Extra                                        |
+----+-------------+---------------+-------------+--------------------+--------------------+---------+------+------+----------------------------------------------+
|  1 | SIMPLE      | tagIntersects | index_merge | PRIMARY,ti_flipped | PRIMARY,ti_flipped | 4,4     | NULL |   27 | Using union(PRIMARY,ti_flipped); Using where |
+----+-------------+---------------+-------------+--------------------+--------------------+---------+------+------+----------------------------------------------+

explain

select * from tagIntersects where (id1=111 or id2=111) and (id1=500 or id2=500);
+----+-------------+---------------+-------------+--------------------+--------------------+---------+------+------+----------------------------------------------+
| id | select_type | table         | type        | possible_keys      | key                | key_len | ref  | rows | Extra                                        |
+----+-------------+---------------+-------------+--------------------+--------------------+---------+------+------+----------------------------------------------+
|  1 | SIMPLE      | tagIntersects | index_merge | PRIMARY,ti_flipped | PRIMARY,ti_flipped | 4,4     | NULL |   21 | Using union(PRIMARY,ti_flipped); Using where |
+----+-------------+---------------+-------------+--------------------+--------------------+---------+------+------+----------------------------------------------+

The EXPLAIN plan looks good above for typical queries. Note the thin key sizes (8 bytes total). The plan shows an index_merge / UNION of two-left most key usages: one with the PK and one with the flipped secondary index. That is the point of ti_flipped.

对于典型的查询,EXPLAIN计划看起来很好。请注意瘦键大小(总共8个字节)。该计划显示了左侧最左侧两个用法的index_merge / UNION:一个具有PK,另一个具有翻转的二级索引。这就是ti_flipped的重点。

Also note that the FK keysizes are thin.

另请注意,FK按键很薄。

Note that tags.name can be readily updated from 'node.js' to 'nodejs' with no impact to the tags Primary Key. And that update would have zero impact on tagsInserted columns or keys.

请注意,tags.name可以很容易地从'node.js'更新为'nodejs',而不会影响标签主键。并且该更新对tagsInserted列或键没有任何影响。

Concerning using Option 2 or 3: the keys are wide. Changes to tags.name would have PK and FK changes as impacts not endured by Option 1 in the intersect table. Depending on your data size (say, something different thing than SO tags), with tens of millions of rows and thousands of intersects for one name, that change could be felt in the UX. For small to medium size, not much to worry about but Profile the impact.

关于使用选项2或3:键很宽。对tags.name的更改将具有PK和FK更改,因为交叉表中的选项1不会承受这些更改。根据您的数据大小(例如,与SO标签不同的东西),具有数千万行和数千个相交的名称,可以在UX中感受到这种变化。对于中小型,不用担心但是配置影响。

So generally I opts for an Option 1 approach due to the enormity of my data sets and seeking to keep keys for relationships thin.

因此,由于我的数据集非常庞大并且试图保持关系密钥的薄弱,我通常选择选项1方法。

Spencer7593 mentioned in a comment recently that large varchars have negative impact on internal memory during joins and with an impact of subqueries / derived's that manifest in temporary tables. Another reason for thin FK's.

Spencer7593最近在评论中提到,大型varchars在连接期间对内部内存有负面影响,并且会影响临时表中显示的子查询/派生。薄FK的另一个原因。

So this has as much to do with the readers in general who have different schemas and think there is no impact on performance with them.

因此,这与具有不同模式的读者一样多,并且认为对他们的性能没有影响。

So profile the performance of your queries before you finalize on a schema (of huge tables especially).

因此,在最终确定模式(特别是巨大的表)之前,要对查询的性能进行概要分析。

Edit 1

create table tags
(   id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(60) NOT NULL,
    UNIQUE KEY `key_tags_name` (name) -- 
    -- All the other columns
)ENGINE=InnoDB;

create table tagIntersects
(   id1 INT NOT NULL,
    id2 INT NOT NULL,
    PRIMARY KEY(id1,id2),
    KEY `ti_flipped` (id2,id1),
    FOREIGN KEY `fk_ti_id1` (id1) REFERENCES tags(id),
    FOREIGN KEY `fk_ti_id2` (id2) REFERENCES tags(id)
)ENGINE=InnoDB;

load:

insert tags(name) values ('tag1'),('tag2'); -- ids 1 and 2

Now I have two id's in some programming language I want to intersect the tags.

现在,我想在一些编程语言中使用两个id来与标签相交。

Using MySQL as a programming language, let's just call them the following vars:

使用MySQL作为编程语言,我们只需将它们称为以下变量:

set @myId1=2; -- actually run this so it is not NULL
set @myId2=1; -- actually run this so it is not NULL

Note that they could be reversed, it does not matter. Now assuming you did not screw up on the programming language such that @myId1 = @myId2 (note the below will still work, but just sayin')

注意它们可以颠倒,没关系。现在假设你没有搞砸编程语言,以至于@ myId1 = @ myId2(请注意下面的内容仍然有用,但只是说')

insert tagIntersects(id1,id2) select LEAST(@myId1,@myId2),GREATEST(@myId1,@myId2); -- ok
insert tagIntersects(id1,id2) select LEAST(@myId1,@myId2),GREATEST(@myId1,@myId2); -- GOOD it failed

-- flip em:

- 翻转em:

set @myId1=1; -- actually run this so it is not NULL
set @myId2=2; -- actually run this so it is not NULL

insert tagIntersects(id1,id2) select LEAST(@myId1,@myId2),GREATEST(@myId1,@myId2); -- GOOD it failed

Your data stays clean. Clean means you would not have two rows in there that dupe up and pollute your data such as an intersect row for MYSQL / SQL-SERVER ... and another row for SQL-SERVER / MYSQL in id1, id2 respectfully.

您的数据保持清洁。清理意味着你不会有两行欺骗和污染你的数据,例如MYSQL / SQL-SERVER的交叉行......以及id1,id2中的SQL-SERVER / MYSQL的另一行。

Edit 2

Question from user Shafizadeh : Ok, you have three tags, tag1, tag2, tag3 .. they are related. So there is three rows in the tagIntersects table like these: tag1,tag2, tag1,tag3, tag2,tag3. Right? Now I want to select all related tags with tag1. Write the query ... :-) seems like a nightmare, huh?

来自用户Shafizadeh的问题:好的,你有三个标签,tag1,tag2,tag3 ..它们是相关的。因此tagIntersects表中有三行,如:tag1,tag2,tag1,tag3,tag2,tag3。对?现在我想用tag1选择所有相关的标签。写查询... :-)看起来像个噩梦,是吧?

Answer:

explain select * from tagIntersects where id1=2 or id2=2; 
+----+-------------+---------------+-------------+--------------------+--------------------+---------+------+------+----------------------------------------------+
| id | select_type | table         | type        | possible_keys      | key                | key_len | ref  | rows | Extra                                        |
+----+-------------+---------------+-------------+--------------------+--------------------+---------+------+------+----------------------------------------------+
|  1 | SIMPLE      | tagIntersects | index_merge | PRIMARY,ti_flipped | PRIMARY,ti_flipped | 4,4     | NULL |   37 | Using union(PRIMARY,ti_flipped); Using where |
+----+-------------+---------------+-------------+--------------------+--------------------+---------+------+------+----------------------------------------------+

My question back to you is, with your CSV, what is your explain plan? It would look awful and a tablescan.

我的问题是,使用您的CSV,您的解释计划是什么?看起来很糟糕,桌面扫描。

2 个解决方案

#1

I recommend Option 1. Below is all about Option 1

Edit 1

Edit 2

更多相关文章

随机推荐