Project

Bibliometrics Engine

The name of the project is Bibliometric Engine

It is a software that takes in input a file in a particular format (CSV), containing citations between scholarly documents, each identified by a digital object identifier (DOI)

The goal of the software is to run particular analysis and extractions on such data

You need a group

The project must be implemented by a group of people

You need to

  • form groups of at least 3 and at most 4 people

  • choose a name for the group (yes: a name) - it will be used to publish the ranks of the best-performing projects (more info later)

  • communicate the name of the group and its members (including their emails) to me by sending an email at silvio.peroni@unibo.it

Final deadline: all groups must be ready by next Wednesday at most (16 December)

Project stub (1/2)

<group_name> import *

class BibliometricEngine(object):
    def __init__(self, citations_file_path):
        self.data = process_citations(citations_file_path)

    def compute_impact_factor(self, dois, year):
        return do_compute_impact_factor(self.data, dois, year)

    def get_co_citations(self, doi1, doi2):
        return do_get_co_citations(self.data, doi1, doi2)

    def get_bibliographic_coupling(self, doi1, doi2):
        return do_get_bibliographic_coupling(self.data, doi1, doi2)

    def get_citation_network(self, start, end):
        return do_get_citation_network(self.data, start, end)
	
    def merge_graphs(self, g1, g2):
        return do_merge_graphs(self.data, g1, g2)

Project stub (2/2)

    def search_by_prefix(self, prefix, is_citing, dump):
        if dump is None:
            return do_search_by_prefix(self.data, prefix, is_citing)
        else:
            return do_search_by_prefix(dump, prefix, is_citing)

    def search(self, query, field, dump):
        if dump is None:
            return do_search(self.data, query, field)
        else:
            return do_search(dump, query, field)
	
    def filter_by_value(self, query, field, dump):
        if dump is None:
            return do_filter_by_value(self.data, query, field)
        else:
            return do_filter_by_value(dump, query, field)

Your output

You have to develop a Python file named as your group, where spaces are substituted by underscores and the whole file is lowercase

E.g.: group Best group ever, file best_group_ever.py

The import statement must specify the module implemented

E.g.: from best_group_ever import *

Each group has to implement the functions that have been highlighted in red in the previous slide

process_citations

def process_citations(citations_file_path)

It takes in input a comma separated value (CSV) file, and return a data structure containing all the data included in the CSV is some form

The data can be preprocessed, changed according to some empirical rule, ordered in a certain way, etc.

These data will be automatically provided in input to the other functions

Example of CSV citation data

citing cited creation timespan
10.2964/jsik_2020_003 10.1007/s11192-019-03217-6 2020-02-29 P0Y5M15D
10.1007/s11192-019-03311-9 10.1007/s11192-019-03217-6 2019-12-04 P0Y2M20D
10.1007/s11192-019-03217-6 10.1007/978-3-030-00668-6_8 2019-09-14 P1Y
10.1007/s11192-019-03217-6 10.1007/978-3-319-11955-7_42 2019-09-14 P5Y
... ... ... ...

citing (string): the DOI of a citing article

cited (string): the DOI of a cited article

creation (string): publication date of the citing article (date format: YYYY-MM-DD, YYYY-MM or YYYY)

timespan (string): the difference between the publication date of the citing article and the publication date of the cited article (duration format): PnYnMnD, PnYnM, or PnY, plus - before P for negative durations

do_compute_impact_factor

def do_compute_impact_factor(data, dois, year)

data: the data returned by process_citations

dois: a set of DOIs identifying articles

year: a string in format YYYY to consider

It returns a number which is the result of the computation of the Impact Factor (IF) for such documents. The IF of a set of documents dois on a year year is computed by counting the number of citations all the documents in dois have received in year year, and then dividing such a value by the number of documents in dois published in the previous two years (i.e. in year-1 and year-2).

do_get_co_citations

def do_get_co_citations(data, doi1, doi2)

data: the data returned by process_citations

doi1: the DOI string of the first article

doi2: the DOI string of the second article

It returns an integer defining how many times the two input documents are cited together by other documents.

do_get_bibliographic_coupling

def do_get_bibliographic_coupling(data, doi1, doi2)

data: the data returned by process_citations

doi1: the DOI string of the first article

doi2: the DOI string of the first article

It returns an integer defining how many times the two input documents cite both the same document.

do_get_citation_network

def do_get_citation_network(data, start, end)

data: the data returned by process_citations

start: a string defining the starting year to consider (format: YYYY)

end: a string defining the ending year to consider (format: YYYY) - it must be equal to or greater than start

It returns a directed graph containing all the articles involved in citations if both of them have been published within the input start-end interval (start and end included). Use the DOIs of the articles involved in citations as name of the nodes.

do_merge_graphs

def do_merge_graphs(data, g1, g2)

data: the data returned by process_citations

g1: the first graph to consider

g2: the second graph to consider

It returns a new graph being the merge of the two input graphs if these are of the same type (e.g. both DiGraphs). In case the types of the graphs are different, return None.

do_search_by_prefix

def do_search_by_prefix(data, prefix, is_citing)

data: the data returned by process_citations or by other search/filter activities

prefix: a string defining the precise prefix (i.e. the part before the first slash) of a DOI

is_citing: a boolean telling if the operation should be run on citing articles or not

It returns a sub-collection of citations in data where either the citing DOI (if is_citing is True) or the cited DOI (if is_citing is False) is characterised by the input prefix.

do_search

def do_search(data, query, field)

data: the data returned by process_citations or by other search/filter activities

query: a string defining the query to do on the data

field: a string defining the column (it can be either citing, cited, creation, timespan) on which running the query

It returns a sub-collection of citations in data where the query matched on the input field. It is possible to use wildcards in the query. If no wildcards are used, there should be a complete match with the string in query to return that citation in the results.

Wild cards and operators

Multiple wildcards * can be used in query. E.g. World*Web looks for all the strings that matches with the word World followed by zero or more characters, followed by the word Web (examples: World Wide Web, World Spider Web, etc.).

Boolean operators can be used: and, or, not
<tokens 1> <operator> <tokens 2>

All matches are case insensitive – e.g. specifying World as query will match also strings that contain world

do_filter_by_value

def do_filter_by_value(data, query, field)

data: the data returned by process_citations or by other search/filter activities

query: a string defining the query to do on the data

field: a string defining column (it can be either citing, cited, creation, timespan) on which running the query

It returns a sub-collection of citations in data where the query matched on the input field. No wildcarts are permitted in the query, only comparisons.

Comparison and operators

Comparison operator can be used in query: <, >, <=, >=, ==, !=
<operator> <tokens>

Boolean operators can be used: and, or, not
<tokens 1> <operator> <tokens 2>

All matches are case insensitive – e.g. specifying World as query will match also strings that contain world

A suggestion

Use the test-driven development to understand when you are doing the right job

Submission

The project, i.e. the implementation of the functions including any additional ancillary function developed, must be included in a single file named as your group, e.g. best_group_ever.py

The file must be sent by email to me at silvio.peroni@unibo.it

Submission: 2 days before the exam session (the whole group must attend the session) – e.g. send the project the 27th of January for discussing it on the session of the 29th of January

Your project will be compared with the others in terms of efficiency (i.e. time needed for addressing specific tasks)

Evaluation

Maximum score = c-score + e-score + o-score = 16

All projects will run on large CSV files

Correctness of the result: c-score <= 4

Efficiency of the software: e-score <= 4; projects ranked according to the time spent for addressing various tasks

  1. e-score = 4

  2. e-score = 3

  3. e-score = 2

  4. e-score = 1

Oral colloquium: -8 <= o-score <= 8; it is a personal score, each member of the group has its own

At least 2 for passing the exam, otherwise an additional effort is required (new function)

END Project