分析标签集的最佳方法是什么？

I have thousands of survey responses that have been tagged according to the content of the response. Each response can have one tag or many (up to 20), and the tags are independent of one another rather than being structured into category-subcategory or something.

我有数以千计的调查回复,这些回复已根据回复的内容进行了标记。每个响应可以有一个标签或多个标签(最多20个),标签彼此独立,而不是结构化为类别子类别或其他东西。

I want to be able to do analysis like the following:

我希望能够进行如下分析:

How many instances of a given tag are there?

有多少个给定标签的实例?

Which tags occur most frequently overall?

哪个标签最常出现?

Where tag X is present, which other tags appear along with it most frequently?

标签X出现在哪里,哪个其他标签最常出现?

List of all tags with the count of each next to it

所有标签的列表,其中包含每个标签的计数

Select subsets of the data to do similar analysis on (by country, for example)

选择数据的子集以进行类似的分析(例如,按国家/地区)

The people I'm working with have traditionally tackled everything in Excel (general business strategy consulting work), and that won't work in this case. Their response is to change the project framework to something that Excel can handle in a pivot table, but it would be so much better if we could use more robust tools that allow for more sophisticated relationships.

与我合作的人传统上处理Excel中的所有内容(一般业务战略咨询工作),在这种情况下不起作用。他们的回应是将项目框架更改为Excel可以在数据透视表中处理的内容,但如果我们可以使用更强大的工具来实现更复杂的关系,那将会更好。

I've been learning SQLite but am starting to fear that the kinds of things I want to do will be pretty complicated.

我一直在学习SQLite,但我开始担心我想要做的事情会非常复杂。

I've also been learning Python (for unrelated reasons) and am kind of wondering if an ORM tool and some Python code might be the better way to go.

我也一直在学习Python(出于无关的原因),并且有点想知道ORM工具和一些Python代码是否是更好的方法。

And then there's something like Access (which I don't have but would possibly be willing to get if it's a sweet spot for this kind of thing).

然后有一些像Access(我没有,但如果它是这种事情的最佳点,可能会愿意获得)。

In summary, I'd love to know how hard these kinds of analysis would be to do overall and which tools would best be suited for the job. I'm completely open to the idea that I'm thinking about some of or all of the problem in a way that's backwards and would welcome any advice on any aspect of what I've written here.

总而言之,我很想知道这些分析总体上有多么难以完成,哪种工具最适合这项工作。我完全乐于接受这样的想法:我正以一种倒退的方式思考一些或所有问题,并欢迎就我在这里所写的任何方面提出任何建议。

4 个解决方案

#1

Collect all tags into a list and use the python collections.Counter and associated methods to get the frequencies and a host of other statistics. Just like this

将所有标签收集到一个列表中,并使用python collections.Counter和相关方法来获取频率和许多其他统计信息。像这样

>>> from collections import Counter
>>> x=['java', 'python', 'scheme', 'scheme', 'scheme', 'python', 'go', 'go', 'c',
... 'c']
>>> freqs = Counter(x)
>>> freqs.most_common(1)
[('scheme', 3)]
>>>

4 个解决方案

#1

更多相关文章

随机推荐