Linked science presentation 25

  • Published on
    01-Jul-2015

  • View
    321

  • Download
    3

DESCRIPTION

Clustering Citation Distributions for Semantic Categorization and Citation Prediction by F. Osborne, S. Peroni, E. Motta In this paper we present i) an approach for clustering authors according to their citation distributions and ii) an ontology, the Bibliometric Data Ontology, for supporting the formal representation of such clusters. This method allows the formulation of queries which take in consideration the citation behaviour of an author and predicts with a good level of accuracy future citation behaviours. We evaluate our approach with respect to alternative solutions and discuss the predicting abilities of the identified clusters. URL: http://oro.open.ac.uk/40784/1/lisc2014.pdf

Transcript

  • 1. Clustering Citation Distributions forSemantic Categorization andCitation PredictionFrancesco Osbornea , Silvio Peronibc, Enrico Mottaa,a KMi, The Open University, United Kingdomb Department of Computer Science and Engineering, Universityof Bologna, Bologna, Italyc Institute of Cognitive Sciences and Technologies, CNR, Rome,ItalyOctober 2014

2. Is it possible to say who will have abigger impact? 3. Can I exploit this information for semanticexpert search? 4. Clusteringof CitationDistributionAuthorsdataClusters of authorswith similar citationpatternsEExxttrraaccttiioonn ooffsemanticfeaturesRDFBiDO OntologyOur approach 5. Clustering Citation DistributionsWe cluster the citation distributions byexploiting a bottom-up hierarchical clusteringalgorithm.We thus need to define: A norm A metric to assess the quality of a set ofclusters 6. Clustering Citation DistributionsA BC D 7. Clustering Citation DistributionsA BC D1. dis(A, B) = dis(C, D) 8. Clustering Citation DistributionsA BC D1. dis(A, B) = dis(C, D)2. dis(A, C) > 0 , dis(B, D) > 0 9. Clustering Citation DistributionsA BC D1. dis(A, B) = dis(C, D)2. dis(A, C) > 0 , dis(B, D) > 03. Can be computed incrementally 10. Clustering Citation DistributionsA simple way to satisfy these threerequirements is to use a normalized Euclideandistance: 11. /2 12. Clustering Citation DistributionsWe want to maximize the homogeneity of thecluster populations in the following years. 13. Standard deviation is not the solution 14. Standard deviation is not the solution 15. Clustering Citation DistributionsWe estimate the homogeneity by computing theweighted average of the MAD: MAD (Median Absolute Deviation ) is a robustmeasure of statistical dispersion and it is used tocompute the variability of an univariate sampleof quantitative data. 16. Clustering Citation DistributionsWe then compute the memberships of all authorsin our dataset with the centroids of the resultingclusters.!$%' (')*,,$%' (')-,, 17. /./0Finally we calculate a number of statistics forestimating the evolution of the members of eachclusters. 18. How can we represent this data?Bibliometric data are subject to the simultaneous application ofdifferent variables. In particular, one should take into account atleast: the temporal association of such data to entities; the particular agent who provided such data (e.g., GoogleScholar, Scopus, our algorithm); the characterisation of such data in at least two differentkinds, i.e., numeric bibliometric data (e.g., the standardbibliometric measures such as h-index, journal impact factor,citation count) and categorial bibliometric data (so as toenable the description of entities, e.g., authors, according tospecific descriptive categories). 19. BiDO 20. Extraction of Semantic features 21. :hasCurve [ a :Curve ;:hasTrend :increasing ; 22. :hasCurve [ a :Curve ;:hasTrend :increasing ; :hasAccelerationPoint :premature-deceleration ] ; 23. :hasCurve [ a :Curve ;:hasTrend :increasing ; :hasAccelerationPoint :premature-deceleration ] ;:hasSlope [ a :Slope ; :hasStrength :low ; 24. :hasCurve [ a :Curve ;:hasTrend :increasing ; :hasAccelerationPoint :premature-deceleration ] ;:hasSlope [ a :Slope ; :hasStrength :low ; :hasGrowth :logarithmic ] ; 25. :hasCurve [ a :Curve ;:hasTrend :increasing ; :hasAccelerationPoint :premature-deceleration ] ;:hasSlope [ a :Slope ; :hasStrength :low ; :hasGrowth :logarithmic ] ;:hasOrderOfMagnitude :[243,729) ; 26. :hasCurve [ a :Curve ;:hasTrend :increasing ; :hasAccelerationPoint :premature-deceleration ] ;:hasSlope [ a :Slope ; :hasStrength :low ; :hasGrowth :logarithmic ] ;:hasOrderOfMagnitude :[243,729) ;:concernsResearchPeriod :5-years-beginning . 27. :increasing-with-premature-deceleration-and-low-logarithmic-slope-in-[243,729)-5-years-beginning a :ResearchCareerCategory ;:hasCurve [ a :Curve ;:hasTrend :increasing ; :hasAccelerationPoint :premature-deceleration ] ;:hasSlope [ a :Slope ; :hasStrength :low ; :hasGrowth :logarithmic ] ;:hasOrderOfMagnitude :[243,729) ;:concernsResearchPeriod :5-years-beginning . 28. :john-doe :holdsBibliometricDataInTime [a :BibliometricDataInTime ;tvc:atTime [ a time:Interval ; time:hasBeginning :2014-07-11 ] ;:accordingTo [ a fabio:Algorithm ;:increasing-with-premature-deceleration-and-low-logarithmic-slope-in-[243,729)-5-years-beginning a :ResearchCareerCategory ;:hasCurve [ a :Curve ;frbr:realization [ a fabio:ComputerProgram ] ] ;:withBibliometricData:increasing-with-premature-deceleration-and-low-logarithmic-:hasTrend :increasing ; :hasAccelerationPoint :premature-deceleration ] ;:hasSlope [ a :Slope ; :hasStrength :low ; :hasGrowth :logarithmic ] ;slope-in-[243,729)-5-years-beginning .:hasOrderOfMagnitude :[243,729) ;:concernsResearchPeriod :5-years-beginning . 29. Evaluation We evaluated our method on a dataset of 20000 researchersworking in the field of computer science in the 1990-2010interval. This dataset was derived from the database of Rexplore , asystem to provide support for exploring scholarly data, whichintegrates several data sources (Microsoft Academic Search,DBLP++ and DBpedia). 30. Evaluation 31. EvaluationYC18 (1.4%) C22 (2.5%) C25 (2.7%) C28 (2.3%) C29 (8.8%)range mean range mean range mean range mean range mean6 420-800 56798 160-280 20934 100-180 12925 60-100 7214 40-60 3997 440-960 610120 160-320 22545 100-200 13830 60-120 7918 40-80 45148 440-1020 650137 160-400 24658 100-260 15845 60-160 9026 40-100 50189 440-1260 699186 160-440 26974 100-340 18768 60-200 10437 40-120 572510 480-2940 751411 160-500 29285 100-400 21182 60-280 12557 40-160 683511 480-2480 826336 180-660 331112 100-520 241100 60-540 155103 40-200 824712 480-3520 914467 180-860 370151 100-640 270126 60-440 16696 40-260 9760 32. Evaluation 33. Future Works Augment the clustering process with a variety ofother features (e.g., research areas, co-authors); apply this technique to groups of researchersrather then single individuals; extend BiDO in order to provide a semantically-awaredescription of such new features; make available a triplestore of bibliometric datalinked to other datasets such as Semantic WebDog Food and DBLP. 34. Questions?francesco.osborne@open.ac.uksilvio.peroni@unibo.ite.motta@open.ac.ukBiDO Ontology:http://purl.org/spar/bido