FACULTEIT ECONOMIE 
EN BEDRIJFSKUNDE 

 
TWEEKERKENSTRAAT 2 

B-9000 GENT 
Tel. : 32 -  (0)9 – 264.34.61 
Fax. : 32 -  (0)9 – 264.35.92 

 
WORKING PAPER 
 
 
Mining Ideas from Textual Information 
 

Dirk Thorleuchter1 

Dirk Van den Poel2 

Anita Prinzie3 
 
 
November 2009 
 

2009/619 
 
 
hofer.de
1  Fraunhofer INT, Appelsgarten 2, 53879 Euskirchen, Germany  & PhD Candidate, Ghent 
University, dirk.thorleuchter@int.fraun   
2  Prof. Dr. Dirk Van den Poel, Professor of Marketing Modeling/analytical Customer Relationship Management, 
Faculty of Economics and Business Administration, dirk.vandenpoel@ugent.be; more papers about customer 
relationship management can be obtained from the website: www.crm.UGent.be, more papers about text mining 
can be downloaded from www.textmining.UGent.be  
3 Prof. Dr. Anita Prinzie is visiting professor at Ghent University 

     D/2009/7012/71 

mailto:dirk.thorleuchter@int.fraunhofer.de
mailto:dirk.vandenpoel@ugent.be
http://www.crm.ugent.be/
http://www.textmining.ugent.be/


Mining Ideas from Textual Information 
 
Dirk Thorleuchter1, Dirk Van den Poel2, and Anita Prinzie2 
 
1Fraunhofer INT, Appelsgarten 2, 53879 Euskirchen, Germany 
2Ghent University, Faculty of Economics and Business Administration, Tweekerkenstraat 2, 9000 
Gent, Belgium 
 
Abstract  
This approach introduces idea mining as process of extracting new and useful ideas from 
unstructured text. We use an idea definition from technique philosophy and we focus on ideas 
that can be used to solve technological problems. 
 
The rationale for the idea mining approach is taken over from psychology and cognitive science 
and follows how persons create ideas. To realize the processing, we use methods from text 
mining and text classification (tokenization, term filtering methods, Euclidean distance measure 
etc.) and combine them with a new heuristic measure for mining ideas.  
 
As a result, the idea mining approach extracts automatically new and useful ideas from a user 
given text. We present these problem solution ideas in a comprehensible way to support users in 
problem solving. This approach is evaluated with patent data and it is realized as a web-based 
application, named 'Technological Idea Miner' that can be used for further testing and evaluation. 
 
Keywords 
Idea Mining, Text Mining, Text Classification, Technology 
 

Introduction 
 

Overview 

An idea is an image existing or formed in the mind but it can be written down as textual 
information. In the last years, we see a continually increasing amount of information. About 80 % 
of all this information is stored in textual form [9]. Examples are research papers, articles in 
technical periodicals, reports, documents, web pages etc. These texts possibly contain many new 
ideas. A new idea is often needed to discover unconventional approaches e.g. to create a 
technological breakthrough. However, a manual extraction of new ideas from these masses of 
texts is time consuming and costly. Therefore, it is useful to search for new problem solution 
ideas automatically. 
 
Text mining or knowledge discovery from texts refers generally to the process of extracting 
interesting information and knowledge from unstructured text [12]. Referring to this, we 
introduce idea mining as an automatically process of extracting new and useful ideas from 
unstructured text using text-mining methods.  

 2


Creating ideas is a well-known topic that is related to psychology and cognitive science. There, 
we find many approaches dealing with how persons create ideas especially for problem solution. 
Therefore, in Sect 2 we focus on a general process of creating problem solution ideas and use it 
as rationale for the idea mining approach. 
 
In recent years, data and text mining techniques explore and analyze huge amounts of available 
textual data [4]. Idea mining uses known methods from these techniques and combine them with 
a new method to create text patterns and a new heuristic measure for mining ideas to realise the 
rationale. Therefore, we present the processing of the idea mining approach in Sect. 3 and we 
introduce this new idea mining measure in Sect. 4. 
 
A further task of idea mining is to present the extracted ideas in a comprehensible way to the user. 
Therefore, we focus on results of comprehensibility research and their relations to our task (see 
Sect. 5). Additionally, we provide an extensive evaluation to show the success of the idea mining 
approach and specifically the heuristic idea mining measure (see Sect. 6). 
 

Idea Definition 

We limit our approach to the technological language because of two reasons. Firstly, the 
technological language is much more standardized than the colloquial language [11,16]. 
Therefore, we get better results by analyzing technological texts with text mining approaches. 
Secondly, our idea definition is taken over from technique philosophy [21]. There, an idea is 
defined as a combination of two things: a mean and an appertaining purpose. An example for an 
idea is a transistor. A transistor is a semiconductor device. It can be used to amplify or switch 
electronic signals. Here, we have a mean (a semiconductor device) and an appertaining purpose 
(to amplify or switch electronic signals).  
 
In general, we talk about a new idea if a know mean is related to an unknown purpose or if a 
known purpose is related to an unknown mean [1]. Then, a new idea is a nanomagnet because a 
nanomagnet is a miniaturized magnet that also can be used to amplify or switch electronic signals. 
Here we have an unknown mean (a miniaturized magnet) appearing together with a known 
purpose. This new idea could be useful to humans who are working in the field of electronic 
signals because in future nanomagnetic technology possibly could replace transistor technology. 
 
Therefore, we define a new and probably useful idea as a text phrase. This text phrase consists of 
domain specific terms that occur together in textual information. These terms can be divided up 
into two subsets. The first subset should represent a known mean (or a known purpose) and the 
second subset should represent an unknown purpose (or an unknown mean). Additionally, all 
terms in the first subset should occur together in a text phrase of the technological problem 
description. 
 
 
 3


Rationale behind Idea Mining 
Creating ideas is a well-known topic that is related to creativity in psychology and cognitive 
science. One of the first descriptions of the creative process was published by Wallas [23]. His 
stage model explains creative insights and illuminations for finding a problem solution. This 
model consists of a four stages process. In stage one 'preparation', the problem is analyzed so that 
a person recognizes the problem's dimensions. The stage two 'incubation / intimation' and the 
stage three 'illumination' transfer the problem from the conscious to the unconscious mind. The 
unconscious mind works on the problem continuously and it probably finds a solution by creative 
insights and illuminations. This solution is transferred to the conscious mind, which means after 
some time the person suddenly gets an idea that is new for him and that probably solves the 
problem. In the last stage 'verification', the idea is tested for novelty and usefulness. 
 
One of the best-known pragmatic approaches of using practical creativity is brainstorming from 
Osborn [17]. The first step in brainstorming is to define the problem e.g. by creating descriptions 
of the problem. Then, persons generate new ideas using creativity methods like idea association 
etc. The last step in the brainstorming process is to cluster the generated ideas and to evaluate it 
for novelty and usefulness. 
 
Beside this, there are several further approaches dealing with the creation of new ideas. We can 
learn from all these approaches that for creating ideas three steps are necessary. The first step is 
to focus on a problem, the second step is to generate some new ideas specific for this problem 
with creative methods and the third step is to evaluate the generated ideas for novelty and 
usefulness concerning the problem. 
 
Referring to these approaches, we build an adequate rationale for the idea mining process. 
Therefore, idea mining also consists of three steps. In the first step, we focus on the problem. 
Here, the user of our idea mining approach has to provide textual information where he describes 
his specific problem (a problem description). In the second step, the user has to provide further 
textual information where he supposes the existence of new and useful ideas (a new text) that 
probably can solve his problem [20]. Ideas are contained in text phrases inside this new text as 
described in Sect. 1.2. Therefore, with an automatically process, we extract a very large number 
of overlapping text phrases from the new text. In the remainder of this paper, text phrases will be 
named text patterns. In the third step, all extracted text patterns are evaluated for novelty and 
usefulness. This means, they are compared to the problem description by using a specific idea 
mining measure. With this measure, text patterns can be classified as new and useful idea. 
Therefore, idea mining identifies new and useful ideas in three steps:  
 
1. Preparation of a problem description 
2. Extraction of text patterns from a new text and 
3. Evaluation of text patterns for novelty and usefulness concerning problem description. 
 

Idea Mining Process 
 

 4


Fig. 1 shows the processing of the idea mining approach in different steps based on the rationale 
for the idea mining process (see Sect. 2).  
 

Figure 1: Processing of our idea mining approach in different steps: After tokenization and term filtering, text 
patterns are created and term vectors are built representing these text patterns. Term vectors from the new text are 
compared to term vectors from the problem description using the Euclidean distance measure. Then, term vectors 
from the new text are compared to their most similar term vectors from the problem description using the idea 
mining measure. As a result, we get term vectors from the new text that represent new and useful ideas. 
 
With tokenization [3], texts are separated in terms and the term unit is word. The set of different 
terms in a text is reduced by using stop word filtering methods and stemming [12]. For this, a 
general list of stop words is used as well as the well-known Porter stemming algorithm [18]. 
 
A related problem to the use of stemming is to identify synonyms and homonyms. Synonyms are 
different words with identical or at least similar meanings. Homonyms are groups of words with 
the same spelling but with different meanings. With stemming synonyms and homonyms cannot 
be identified because stemming does not use knowledge of the context of a term. In this idea 
mining approach, we do not identify synonyms and homonyms. This is because the approach 
always considers the context of a term by working on text patterns containing several co-
occurring terms as described below. 
 
Here, we show how to create these text patterns automatically. Around each appearance of each 
term in the new text, we create a text pattern containing the selected term and all terms, which 
occur in the left and right context of the selected term. To reduce the number of text patterns, we 
only create text patterns around non-stop words and around terms that occur both in the new text 
and in the problem description.  
 
One important decision to be taken is to determine the length of a text pattern. Text patterns 
should not be too small so that they contain all terms representing a new idea. Further text 
patterns should not be too large so that only terms occur in the text patterns that are related to the 
new idea. For example if we set the length of the text patterns to l  then a text pattern contains the 

 5


selected term, l  terms from its left context and also  terms from its right context. The cardinality 
of the set of stop word filtered and stemmed terms from this pattern is normally smaller than 

because some terms are stop words, some terms occur twice and some terms have the 
same stem. 

l

1l2 +*

Nn ∈

 
In this paper, we do not use a constant length l  for all patterns but a variable length of text 
patterns based on a dynamic adaptation of its context. This is realized by using a term weighting 
scheme based on the difference between stop words and non-stop words because the importance 
of a stop word in a text pattern is not as high as the importance of a non-stop word. If an author 
formulates an idea very briefly by joining catchwords together then he normally does not use 
many stop words and the text pattern length can be small. If an author formulates an idea in a 
flowery style that means his writing is not expressed in a clear and simple way then he normally 
uses more stop words and the text pattern length has to be larger. In the idea mining application 
the value of text pattern length  and the percentage of the importance of stop words  and of 
non-stop words v  can be provided by the user. 

l u

 
To compute the variable length of a text pattern, we firstly define the term weighting scheme. 
 
Definition 1. Let (a text)  be a list of terms (words)  in order of appearance and 
let  be the number of terms in 

],..,[ 1 nwwT = iw
T  and i ]n,..,1[∈ . Let ]~,..,~[ mw1w=Σ  be a set of domain 

specific stop terms [15] and let  be the number of terms in Σ . Let the percentage  
be a term weighting coefficient for stop words. Let the percentage  be a term weighting 
coefficient for non-stop words. Then, we define  as term weighting scheme:  

Nm∈ Nu ∈
N∈v

N∈w(f ig )

=)( ig wf
⎪⎩

⎪
⎨
⎧

Σ∉

Σ∈

i

i

wv

wu
       { }),..,1( ni ∈∀     (1) 

 
We give an example for this. The text pattern 'components for frequency conversion of infrared 
lasers' is built around the word 'conversion'. It contains the word conversion itself, three terms 
from its left context (components for frequency), and three terms from its right context (of 
infrared lasers). Here, we use a constant length  and a term weighting scheme with  
100 %. This means the importance of a stop word is equal to the importance of a non-stop word. 
The next text pattern is an example for a variable length: 'In a 1st phase, known but so far not 
available materials and technologies such as layer systems and crystals'. This text pattern is built 
around the word 'technologies'. Here we use a constant length  and a term weighting scheme 
with  10 % and 100 %. As a result, this text pattern contains six terms from the right 
context and eleven terms from the left context of the term 'technologies'. In this example, non-
stop words are phase, materials, technologies, layer, systems, and crystal. We compute the 
number of terms from the left and right context as described below: 

3l = == βα

3l =
=u

Nl left ∈

=v

 
Definition 2. Let  be a constant length of text patterns. Let  be the number of terms 
from the left context of a text pattern that is built around the term . Let  be the number of 
terms from the right context of a text pattern that is built around the term . Then, we define 

 and  as:  

Nl ∈

N∈

left
il

iw
right
il

iw

i l
right
i

 6


)())((
1

min njilwfl
j

k
kig

j

right
i =+∨≥= ∑

=
+   { n,..,1i ∈ }∀   (2) 

)1())((
1

min =−∨≥= ∑
=

− jilwfl
j

k
kig

j

left
i   { n,..,1i ∈ }∀   (3) 

 
After computing  and , we can build a text pattern  around the term  from the text 

. 

left
il

right
il iT iw

],..,[ 1 nwwT =
 

],...,,..,[ right
i

left
i liili

wwwT
+−

=      (4) 

 
For each text pattern from the new text, we create a term vector in vector space model. The size 
of the vector is defined by the number of different stemmed and stop word filtered terms in the 
new text. For text pattern encoding, we use binary term vectors that means a vector element is set 
to one if the corresponding unstemmed term is used in the text pattern and to zero if the term is 
not. We also build text patterns from the problem description and create term vectors as described 
above. 
 
To identify new and useful ideas, we create a specific idea mining measure. This idea mining 
measure is described in Sect. 4. By comparing a vector from the new text to one from the 
problem description, we can compute a result value always between 0 % and 100 % using this 
measure. The greater the result value the more is the probability that the vector from the new text 
represents a new and useful idea concerning a vector from the problem description. 
 
We use this measure for comparing vectors from the new text to their most similar vectors from 
the problem description but not to all vectors. This is because result values from comparing a 
vector to its most similar vectors predominate result values from comparing a vector to its further 
vectors. For example if a vector from the new text is similar to one from the problem description 
then the idea is not new to the user regardless whether result values from comparing this vector to 
further vectors from the problem description are greater than zero. Therefore, we can be sure that 
a vector represents a new and useful idea only if it gets a great result value from idea mining 
measure concerning one of its most similar vectors. Further, the computing of the idea mining 
measure is time consuming. Therefore, it is necessary to limit the number of comparisons with 
idea mining measure for implementing an idea mining application. 
 
We choose a two-step classification way. In the first step, we compare each vector from the new 
text to all vectors from the problem description by using the well-known Euclidean distance 
measure. Fortunately, the computing of the Euclidean distance measure is not time consuming so 
that it is suited for implementing in an idea mining application. In detail, for each vector from the 
new text, we identify all vectors from the problem description where the Euclidean distance result 
value is the lowest that means we identify the most similar vectors. In the second step, we 
compare each vector from the new text to its most similar vectors using the idea mining measure.  
 

 7


Each vector from the new text - that is compared to several similar vectors - gets the highest 
result value from idea mining measure as result value. To identify a new and useful idea we use 
alpha-cut method. An alpha-cut of the idea mining measure result value is the set of all vectors 
from the new text such that the appertaining result value is greater than or equal to alpha (α~ ). In 
the idea mining application, the user can provide the value of . α~
 

Idea Mining Measure 
With the idea mining measure, we compare a vector that represents a text pattern from the new 
text to its most similar vectors from the problem description to identify a new and useful idea 
inside the text pattern from the new text. In detail, we have to find text pattern from the new text 
where all terms representing a mean (purpose) and no terms representing a purpose (mean) occur 
in a text pattern from the problem description. 
 
If all terms in the text pattern from the new text are known, which means all terms also occur in a 
text pattern from the problem description then the idea is not new to the user. Furthermore, the 
idea is not useful if all terms in the text pattern from the new text are unknown because there is 
no relation to the problem. It is shown in [22] that to find new and useful ideas the number of 
known terms (e.g. representing a mean) and the number of unknown terms (e.g. representing an 
appertaining purpose) shall be well balanced. 
 
Definition 3. Let  be a the set of stemmed and stop word filtered terms representing a text 
pattern with number i  from the new text. Let  be a set of stemmed and stop word filtered 
terms representing a text pattern with number 

iα

jβ
j  from the problem description. Let  be the set of 

all stemmed and stop word filtered terms from the new text. Let 
γ

γx =  be the cardinality of . 

Let  be a term vector in vector space model concerning . Let  be a term 

vector in vector space model concerning . Let 

γ
{ } xi 10ω ,∈ iα { 1, }

x
j 0ρ ∈

jβ ∑ ===
x

1k ki,i
ωαp  be the number of all 

(known and unknown) terms in text pattern with number . Let i ∑
x

1
ω

=k kiji
ρβαq •=∩= , kj ,  be 

the number of known terms in text pattern with number i  concerning a text pattern with number 
j  from the problem description. Then, we define  as measure for well-balanced known and 

unknown term distribution. 
1m

 
⎪
⎪
⎩

⎪⎪
⎨

⎧

⋅

−⋅

=

p
q

p
qp

m
2

)(2

1  
)

2
(

)
2

(

p
q

p
q

<

≥
    (5) 

 
The known terms in the text pattern from the new text should occur in the problem description 
more frequently than other terms. This is because they represent a known mean or a known 
purpose that is a central part of the problem. In the problem description, terms that represent the 

 8


problem occur more frequently than other terms. For this, we define these frequent terms by 
using a percentage z  as parameter and we compute  as the number of known and frequent 
terms over the number of all known terms. 

2m

 
Definition 4. Let z  be a percentage. Let  be a set of δ z % most frequently stemmed and stop 
word filtered terms in the problem description. Let ξ  be a term vector in vector space 

model concerning δ . Let 

{ } x1,0∈

k
x

1kji
δβαr ==

=
∑∩∩ kjki ξω •,, ρ•  be the number of known terms, 

which occur frequently in the problem description. We define  as measure for frequently 
occurrence of known terms in the problem description. 

2m

 
q
r

m =2       (6) 

 
The unknown terms in the text pattern from the new text represent a new approach (an unknown 
mean or purpose), which is a central part of the new idea. These terms normally occur more 
frequently than other terms in the new text because this text deals about the new idea. For this, 
we also define these frequent terms by using a percentage z  as parameter and we compute  as 
the number of unknown and frequent terms over the number of all unknown terms. 

3m

 
Definition 5. Let φ  be a set of % most frequently stemmed and stop word filtered terms in the 
new text. Let  be a term vector in vector space model concerning . Let 

z
{0 } x1τ ,∈ φ

k
x

1k kjki
x

1k kiji
τρωτωφβαs •••=

==
∑∑∩∩ ,,,

nknown 
k=  be the number of unknown terms, which 

occur frequently in the new text. We define  as measure for frequently occurrence of u
terms in the new text. 

3m

 
qp
s

m
−

=3        (7) 

 
There are often characteristic terms (higher, quicker, integrated, minimized etc.) that occur 
together with new ideas. They point to a changing purpose or a changing mean and can be an 
indicator for new ideas. 
 
Definition 6. Let  be a set of these characteristic terms (stemmed and stop word filtered). Let 

 be a term vector in vector space model concerning . Let 

λ
{ } x10θ ,∈ λ ∑∩

x

1k kkii
θωλαt

=
•== ,  

be the number of these characteristic terms in text pattern with number i . We define  as 
measure for changing means and purposes. 

4m

⎩
⎨
⎧

=
>

=
)0(0
)0(1

4 t
t

m       (8) 

 
 9


The idea mining measure bases on all four heuristic sub measures. 
 

Definition 7. Let  and let  be weighting factors with ∑ . Let the idea 
mining measure be the sum of all four sub measures multiplied by weighting factors  in case 
of . 

{ 41h ,..,∈ } 0g h ≥ = =
4

1h h
1g

hg
qp ≠

 
⎩
⎨
⎧

=
≠+++

=
)(0
)(44332211

qp
qpmgmgmgmg

m      (9) 

 
Idea Mining and Comprehensibility Research 
The aim of idea mining is to find new and useful ideas but also to present these ideas in a 
comprehensible way to the user. To realize this, we focus on comprehensibility research.  
 
Up to the 1960s comprehensibility was a property of the text. It was measured in an objective 
way by analysing text parameters like word length, sentence length, word-usability, relationship 
between number of different words and number of words. The well-known approach in this time 
was the 'Reading Ease'-formula from Flesch [8]. 
 
Later research in this field focuses on cognitive effects by doing textual production and reception. 
The results of this research are presented by two approaches: the 'Hamburger 
Verständlichkeitsmodell' [14] and the 'Groebener Modell' [10]. Both approaches describe four 
dimensions of comprehensibility: simplicity, structure-organization, brevity-shortness and 
interest-liveliness. 
 

Figure 2: We present the new text back to the user with text patterns in bold print that represent new and useful ideas. 
 
A further approach from cognition research is named text excerption. If a human expert finds 
new and useful ideas in texts he highlights all corresponding text phrases e.g. with text marking. 
This behaviour is described by Puppe et al. [19]. 
 
In the idea mining application, text excerption is used to present the extracted ideas to the user 
(Fig. 2 shows an example). For the 'Groebener Modell' marking text pattern is important for 
structure-organization and this leads directly to comprehensibility. In this point, there are 
differences between the 'Groebener Modell' and the 'Hamburger Verständlichkeitsmodell' in 
which structure-organization is not so important for comprehensibility.  
 

 10


As a result, the presentation of ideas in the idea mining application based on text excerption. It is 
comprehensible after the 'Groebener Modell' and it is less comprehensible after the 'Hamburger 
Verständlichkeitsmodell'. 
 
 
Results and Discussions 
 
In a study for the German Ministry of Defence (MoD), we use this approach to identify new 
technological ideas for the German defence research program. In detail, we have to identify new 
solution ideas to solve current problems in German defence based research projects. We extract 
new ideas from 300 descriptions of research projects granted in 2006 by the National Institute of 
Standards and Technology (NIST) in the United States Small Business Innovation Research 
(SBIR) Program. We use textual information from current defence based research projects of the 
German MoD as problem description. As a result, we extract several new ideas that are useful for 
German defence research planners and that now are used as starting point for collaboration 
projects or for new defence based research projects. A proper selection of these ideas is a 
strategic issue and - together with the weapon selection problem [5] - it has significant impacts to 
the efficiency of future defence systems. The results are published in [6]. Here, we show some 
successful examples: 
 
A modified focal plane array technology is identified that can be used to create a detector for the 
far ultraviolet spectrum. It leads to an improvement of military reconnaissance. This idea is new 
because up to now focal plane array technology is only used in the infrared, visual and near 
ultraviolet area.  
 
Further, the approach identifies personnel ultrasonic locating equipment that was originally 
developed to make orientation possible for fire fighters in dense smoke. It also can be used to 
improve the location and navigation of soldiers in urban warfare (e.g. in buildings). 
 
Additionally, the approach shows that the use of avalanche photodiode (APD) technology can 
improve the internal gain and the dark current of infrared detectors. This also leads to an 
improvement of military reconnaissance. 
 
This study shows that some of the automatically extracted ideas are useful for technological 
research planners from the German MoD. Unfortunately, the used problem description (textual 
information about current defence based research projects) is classified as German restricted 
(Verschlusssache - Nur für den Dienstgebrauch) that means it is not allowed to distribute it to the 
scientific community. Therefore, we cannot use the results of this study to evaluate this idea 
mining approach. However, a separate evaluation (see Sect. 6) is done using (unclassified) patent 
data that allows re-computing of the evaluation. 
 

 11


Evaluation 
The idea mining measure as central point in the idea mining approach consists of four heuristic 
sub-measures that are not theoretically founded. Therefore, it is crucial to provide an extensive 
evaluation to show their success. We compare this approach to a baseline because we are not 
aware of other approaches for idea mining. As measure for the baseline, we use Jaccard's 
coefficient [7] as well-known heuristic similarity measure.  
 
The idea mining approach is evaluated by using our idea mining application (see Sect. 8). There 
the web based application and all texts that are used for evaluation are presented. Additionally, 
we create an alternative idea mining application, based on Jaccard's coefficient instead of the idea 
mining measure for the sole purpose of comparison to the baseline. 
 
For evaluation, we use patent data because in patent descriptions, we normally can find new ideas, 
which include a considerable part of scientific and technological knowledge [13]. We use the 
abstract of a patent as new text. A patent often bases on further patents. We aggregate abstracts of 
theses references as problem description. Then we identify new and useful ideas from this patent 
concerning its patent references using the idea mining applications. 
 
We use abstracts from 40 randomly selected patents and from their references, a general stop 
word list and Porter stemmer for evaluation. Then we determine the parameters of the idea 
mining measure ( , , , , , and 1g 2g 3g 4g α~ z ) as well as the parameters for the length of the text 
patterns ( , , and ).  l u v
 
For this, we use further patent data and their references as new text and as problem description. 
The results are evaluated by a human expert and compared to each single sub measure , , 

 and  alone. We find out that using the first sub measure alone is successful. If this sub 
measure is small then the corresponding text pattern normally does not contain a new and useful 
idea. If this sub measure is large then the probability that the text pattern contains a new idea is 
also high. We also find out that using the further sub measures alone is not successful. This 
means, they are successful only if the result value of the first sub measure is medium to high. 
Therefore, they only can be used in addition to the first sub measure.  

1m 2m

3m 4m

 
The results of the second and third sub measures depend on the parameter . This parameter is 
used to define frequent terms by building a set of  % most frequently stemmed and stop word 
filtered terms. We heuristically think that this parameter should be between 10 % and 30 % to get 
good sub measures. This is because if  is greater than 30 % then we probably classify several 
terms, which only occur once as frequent terms. If  is smaller than 10 % then we only identify 
high frequently terms for the set. In this case, the result values of the second and third sub 
measures are small regardless weather known terms occur frequently in the problem description 
or unknown terms occur frequently in the new text. Therefore, we determine  to the mean value 
(20 %). Additionally, we see that the second and third sub measure is nearly equally successful 
and that the fourth sub measure is less successful. Therefore, we heuristically determine the 
parameters of  to 50 %,  to 20 %,  to 20 % and  to 10 %.  

z
z

z
z

z

1g 2g 3g 4g
 

 12


We also have used other values to optimize the combination of these four sub measures. However, 
we do not find a combination that is generally superior to the selected combination. This is 
because the success of these value combinations depends on the quality of the user given textual 
information.  
 
Then, we determine the alpha cut value  of the idea mining measure . If the percentage α~  is 
small then we get many result items. This leads to a small precision value because many 
extracted text patterns do not contain a new and useful idea. If  is large then we only get a very 
small number of results and probably our recall value is small because we do not find most of the 
new and useful ideas in the new text. A human expert checks the results of several patent 
descriptions for an optimal value of . He gets the experience that 60 % is a good compromise. 
Therefore, we set  to 60 %. We also determine the alpha cut value of Jaccard's coefficient as 
measure for the baseline to 20 % by using the same way of evaluation as described above. 

α~ m

α~

α~
α~

 
After this, we determine the length of the text patterns. The length depends on the parameter l  
and on , a term weighting scheme that is based on the difference between stop words and 
non-stop words (see Sect. 3). Text patterns should not be too small so that they contain all terms 
representing a new and useful idea. Additionally, text patterns should not be too large so that 
further terms occur in the text patterns that are not related to the new and useful idea. To find out 
an optimal size of text patterns, we create text patterns from several patent descriptions by using 
different values for  and for the percentages u  and . A human expert checks the different 
length of these text patterns for an optimal size. He gets the best results by setting the value of 
text pattern length l  to 7 terms and the percentage  to 50 % and  to 100 % . 

)( ig wf

l v

u v
 
Then, the approach extracts automatically about 200 new ideas from the 40 randomly selected 
patents. To cluster these results, means and purposes are assigned to scientific categories in the 
science citation index and examples are presented below. Several ideas are identified that uses 
methods from 'Artificial Intelligence' (mean) for applications in 'Health Care Sciences and 
Services' (purpose). We also identify new ideas using 'Imaging Science and Photographic 
Technology' (mean) for 'Medical Informatics' purposes. Further ideas use techniques from 
'Remote Sensing' (mean) in the field of 'Tropical Medicine' (purpose). Additionally, several ideas 
use 'Computer Science, Theory and Methods' (mean) for applications in 'Psychiatry' (purpose). 
Furthermore, methods from 'Artificial Intelligence' (mean) are used for 'Automation and Control 
Systems' purposes. 
 
To evaluate these results, we use precision and recall measures commonly used in information 
retrieval based on true positives, false positives and false negatives. For this, we have to define 
the ground truth for our evaluation. Therefore, a human expert also identifies new and useful 
ideas from these patents manually that means without using our idea mining approach. He uses 
the idea definition in Sect. 1.2. This means, he checks each text pattern for finding terms 
representing a known mean (purpose) and terms representing an unknown purpose (mean). These 
results are the ground truth for the evaluation.  
 
For each patent, we compute its precision and recall values by using the idea mining measure and 
by using the Jaccard's coefficient. Then, we compute the average precision and recall values. As a 
result, we get a precision value of 40 % and a recall value of 25 % by using the idea mining 

 13


approach with the idea mining measure. A precision value of 40 % means that if the idea mining 
approach extracts ten text patterns then four of them represent a new and useful idea. A recall 
value of 25 % means that if there are four new and useful ideas in the new text then the idea 
mining approach extracts only one of them. In contrast to this, we get a precision value of 30 % 
and a recall value of 20 % by using Jaccard's coefficient. This is because in some texts Jaccard's 
coefficient extracts text patterns from the new text that are similar to text patterns from the 
problem description. This represents probably a known idea but not a new idea. 
 
Beside Jaccard's coefficient, we also test other well-known heuristic measures like overlap-index, 
cosine-similarity and dice-similarity [7] as baseline. However, we get nearly the same results for 
the precision (30 %) and for the recall (20 %) value.  
 
 
The Idea Mining Application 
 
The idea mining application focus on users without extensive knowledge in the text mining field 
as well as on text mining experts. We give them the possibility to extract specifically problem 
solution ideas for their own needs using this idea mining approach. They can access to the web-
based application via the internet. It is available under http://www.text-mining.info and it is 
programmed in perl and ruby.  
 
A user has to provide two textual files, a problem description and a new text that probably 
consists of problem solution ideas. These files can be formatted in various ways e.g. as plain text, 
html, xml etc. However, scripting code, (html- or xml-) tags, and images are discarded that means 
the application extracts plain text from the provided files. Then, the user has to select the 
language of these texts to integrate a general stop word list of this language. The application 
offers general stop word lists in English, German, Dutch, Spain and French. After determining 
the parameters of the application the automatically extraction of new and useful ideas from the 
new text starts as described in the idea mining process (see Sect. 3 and Sect. 4). As a result, new 
ideas are presented as described in Sect. 5. 
 
 
Conclusions and Future Research 
 
 
This study shows the success of an automatic approach for finding new ideas from textual 
information. For this, the study transforms creativity approaches from psychology and cognitive 
science to text mining approaches. One main finding here is to redefine an abstract term (an idea) 
in a concrete way that it can be used for computing with text mining methods. In detail, it is 
shown that a technological idea represents a combination of a purpose and a mean and that 
purposes and means are defined by a combination of terms, which co-occur. 
 

 14


Additionally, it is shown that problems and problem solution ideas can be represented as term 
vectors in vector space model. For this, the study contributes a new (idea mining) measure. This 
measure identifies new ideas by comparing vectors that represent a problem to vectors that 
represent a problem solution idea. Last, it is shown that approaches from comprehensibility 
research can be adopted to this approach to present the new ideas in a comprehensible way to the 
user. As further main finding, it is demonstrated that this theoretical approach can be realized by 
a web-based application. The success of the idea mining measure is proved by comparing it to 
further heuristic measures (overlap-index, cosine-similarity and dice-similarity).  
 
Directions for future research are given by the fact that nowadays there is a large amount of 
textual information available on the internet and this information probably contains many new 
technological ideas. Enlarging this approach to a web idea mining approach that automatically 
identifies problem solution ideas from the internet is an interesting topic for further research. 
 
Additionally, the parameters of the approach can be optimized and the idea mining measure can 
probably be enlarged with further aspects to improve its quality that means to get better results 
for the precision and recall values.  
 
A further aspect is to transform this idea mining approach to the colloquial language. For this, it 
is necessary that the idea definition also contains new product ideas from the consumers. Then, 
new product ideas can be identified to support marketing activities. 
 
Last, the approach can be extended with innovation-related aspects. Then, extracted ideas can be 
classified as innovative ideas and might be used as starting point for the new product 
development. 
 
 
Acknowledge 
We thank Joachim Schulze and Jörg Fenner for constructive technical comments. 
 
References 
 
[1] Albers, S., & Gassmann, O. (2005). Handbuch Technologie- und Innovationsmanagement: Strategie- 
Umsetzung- Controlling (p.196). Wiesbaden: Gabler Verlag. 
[2] Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern Information Retrieval. New York: ACM Press. 
[3] Coussement, K., & Van den Poel, D. (2008). Integrating the voice of customers through call center emails into a 
decision support system for churn prediction. Information & Management 45, 165. 
[4] Coussement, K., & Van den Poel, D. (2009). Improving customer attrition prediction by integrating emotions 
from client/company interaction emails and evaluating multiple classifiers. Expert Systems with Applications 36, 
6127-6134. 
[5] Dagdeviren, M., Yavuz, S., & Kilinc, N. (2009). Weapon selection using the AHP and TOPSIS methods under 
fuzzy environment. Expert Systems with Applications 36, 8150. 
[6] Fenner, J., & Thorleuchter, D. (2009). Textmining-Analyse von Forschungsvorhaben des National Institute of 
Standards and Technology. Euskirchen: Fraunhofer INT Edition. 
[7] Ferber, R. (2003). Information Retrieval (p. 74-80). Heidelberg: dpunkt.verlag. 
[8] Flesch, R. (1948). A new readability yardstick. Journal of Applied Psychology 32, 221-233. 
[9] Gentsch, P., & Hänlein, M. (1999). Text Mining. WISU 12, 1646. 
[10] Groeben, N. (1982). Leserpsychologie: Textverständnis - Textverständlichkeit. Münster: Aschendorff. 

 15


 16

[11] Hoffmann, L., Kalverkämper, H., & Wiegand, H.E. (1998). Fachsprachen - Languages for Special purposes: 
Ein internationales Handbuch zur Fachsprachenforschung und Terminologiewissenschaft - an international 
Handbook of Special-language and Terminology Research (p. 1602). Berlin: Walter de Gruyter. 
[12] Hotho, A., Nürnberger,  A., & Paaß, G. (2005). A Brief Survey of Text Mining. LDV Forum  20(1), 19-26. 
[13] Li, Y.R., Wang, L.H., & Hong, C.F. (2009). Extracting the significant-rare keywords for patent analysis. Expert 
Systems with Applications 36, 5200-5204. 
[14] Langer, I., Schulz v. Thun, F., & Tausch, R. (1974). Verständlichkeit in Schule und Verwaltung. München: 
Ernst Reinhardt. 
[15] Lustig, G. (1986). Automatische Indexierung zwischen Forschung und Anwendung (p. 92). Hildesheim: Georg 
Olms Verlag. 
[16] Martin-Bautista, M.J., Sanches, D., Serrano, J.M., & Vila M.A. (2004). 
Text Mining using Fuzzy Association Rules. In V. Loia, M. Nikravesh, & L.A. Zadeh (Eds.), Fuzzy Logic and the 
Internet (p. 173). Berlin, Springer-Verlag. 
[17] Osborn, A.-F. (1948). Your Creative Power. New York: C. Scribner's sons. 
[18] Porter, M.F. (1980). An algorithm for suffix stripping. Program 14(3), 130-137. 
[19] Puppe, F., Stoyan,  H., & Studer, R. (2003). Knowledge Engineering. In G. Görz, C.R. Rollinger, & J. 
Schneeberger (Eds.), Handbuch der Künstlichen Intelligenz (p. 611). München: Oldenbourg. 
[20] Ripke, M., & Stöber, G. (1972). Probleme und Methoden der Identifizierung potentieller Objekte der 
Forschungsförderung. In H. Paschen & H. Krauch (Eds.), Methoden und Probleme der Forschungs- und 
Entwicklungsplanung (p. 47). München, Oldenbourg.  
[21] Rohpohl, G. (1996). Das Ende der Natur. In L. Schäfer, & E. Sträker (Eds.), Naturauffassungen in Philosophie, 
Wissenschaft und Technik (pp. 143-163). Freiburg, München: Alber. 
[22] Thorleuchter, D. (2008). Finding Technological Ideas and Inventions with Text Mining and Technique 
Philosophy. In C. Preisach, H. Burkhardt, L. Schmidt-Thieme, & R. Decker (Eds.) Data Analysis, Machine Learning, 
and Applications (pp. 413-420). Berlin: Springer-Verlag. 
[23] Wallas, G. (1926). The Art of Thought. New York: Harcourt Brace. 
 
 
	FACULTEIT ECONOMIE
	TWEEKERKENSTRAAT 2
	B-9000 GENT
	WORKING PAPER


	November 2009
	Introduction
	Overview
	Idea Definition

	Rationale behind Idea Mining
	Idea Mining Process
	Idea Mining Measure
	Idea Mining and Comprehensibility Research
	Results and Discussions
	Evaluation
	The Idea Mining Application
	Conclusions and Future Research