Project

Communication 1

The aperitif will be on the 19th of December, 19:00, at "Kinotto", Via Sebastiano Serlio, 25/2 (Google Maps, Open Street Maps)

Bus: number "21", stop "Dopolavoro Ferroviario" – itinerary, timetable

Scholarly Network Engine (SNE)

The name of the project is Scholarly Network Engine

It is a software that takes in input two files in a particular format (CSV)

  • one describing the metadata about existing scholarly articles
  • another describing all the citations that exists between such articles

The goal of the software is to run particular analysis on the various networks generated by all these information

You need a group

The project must be implemented by a group of people

You need to

  • form groups of at least 3 and at most 4 people

  • choose a name for the group (yes: a name) - it will be used to publish the ranks of the best-performing projects (more info later)

  • communicate the name of the group and its members (including their emails) to me by sending an email at silvio.peroni@unibo.it

All groups must be ready by next Wednesday at most (19 December)

Project stub

from <group_name> import *

class ScholarlyNetworkEngine(object):
    def __init__(self, metadata_file_path, citation_file_path):
        self.sse = ScholarlySearchEngine(metadata_file_path)
        self.data = process_citation_data(citation_file_path)

    def citation_graph(self):
        return do_citation_graph(self.data, self.sse)

    def coupling(self, doi_1, doi_2):
        return do_coupling(self.data, self.sse, doi_1, doi_2)

    def aut_coupling(self, aut_1, aut_2):
        return do_aut_coupling(self.data, self.sse, aut_1, aut_2)

    def aut_distance(self, aut):
        return do_author_distance(self.data, self.sse, aut)

    def find_cycles(self):
        return do_find_cycles(self.data, self.sse)

    def cit_count_year(self, aut, year):
        return do_cit_count_year(self.data, self.sse, aut, year)

Your output

You have to develop a Python file named as your group, where spaces are substituted by underscores and the whole file is lowercase

E.g.: group Best group ever, file best_group_ever.py

The import statement must specify the module implemented

E.g.: from best_group_ever import *

Each group has to implement the seven functions that have been highlighted in red in the previous slide

It is possible to reuse all the methods of the Scholarly Search Engine, if needed (and suggested)

Reuse of the past year project

A new instance of the Scholarly Search Engine is created every time a new Scholarly Network Engine is instantiated

The Scholarly Search Engine was past year's project, and the main class makes available several methods for analysing and querying the table of the metadata of scholarly document passed as input

Suggestion: reuse its methods when possible

process_citation_data

def process_citation_data(file_path)

It takes in input a comma separated value (CSV) file, and return a data structure containing all the data included in the CSV is some form

The data can be preprocessed, changed according to some empirical rule, ordered in a certain way, etc.

These data will be automatically provided in input to the other functions

Example of CSV citation data

doicited byknown refs
10.7717/peerj-cs.147210.7717/peerj-cs.1; 10.7717/peerj-cs.86
.........
.........

doi (string): the identifier of an article

cited by (integer): the number of other articles that are citing the one identified by doi, overall

known refs (list of strings: doi 1; doi 2; ...): the DOIs of the articles managed by the Scholarly Search Engine that are cited by doi

do_citation_graph

def do_citation_graph(data, sse)

data: the data returned by process_citation_data

sse: an instance of the Scholarly Search Engine

It returns a directed graph (hint: use NetworkX) describing the citation network between the articles managed by sse. The particular identifier of each node in the graph should be a string representing the article (see the pretty_print method of sse)

The articles that are not involved in any citation are not included in the graph

do_coupling

def do_coupling(data, sse, doi_1, doi_2)

data: the data returned by process_citation_data

sse: an instance of the Scholarly Search Engine

doi_1: a DOI identifying an article

doi_2: a DOI identifying another article

It returns an non-negative integer which indicates the coupling strength of two given documents, i.e. how many identical documents they cite

do_aut_coupling

def do_aut_coupling(data, sse, aut_1, aut_2)

data: the data returned by process_citation_data

sse: an instance of the Scholarly Search Engine

aut_1: an author's full name

aut_2: another author's full name

It returns a non-negative integer which indicates the coupling strength of two authors, i.e. how many identical documents have been cited by their unique body of work (i.e. all the documents they wrote except those ones they coauthored)

do_aut_distance

def do_aut_distance(data, sse, aut)

data: the data returned by process_citation_data

sse: an instance of the Scholarly Search Engine

aut: an author's full name

It returns the graph of all the authors (nodes) reachable from aut (included) following only coauthorship relations (edges). Each edge has the attribute co_authored_papers (i.e. the number of papers coauthored by the linked authors), and each node has the attribute distance, i.e. the minimum number of edges needed to reach aut

do_find_cycles

def do_find_cycles(data, sse)

data: the data returned by process_citation_data

sse: an instance of the Scholarly Search Engine

It returns a list of tuples, where each tuple contains the sequence of nodes that form a cycle in the citation network i.e. a path that, starting from a node, allows one to go back to that node by following the citation links availables

do_cit_count_year

def do_cit_count_year(data, sse, aut, year)

data: the data returned by process_citation_data

sse: an instance of the Scholarly Search Engine

aut: an author's full name

year: an integer representing an year

It returns a key:value dictionary of integers: key represents an year, while value is the sum of all the citations received by the papers authored by aut in that particular year

do_cit_count_year (addition)

If the parameter year is not None, the keys lesser than year are excluded from the dictionary

If the parameter year is None, the lower publication year of the articles authored by aut is considered as starting year

If aut does not have any citations in a particular year, then the dictionary must specify it anyway, with 0 citations

A suggestion

Use the test-driven development to understand when you are doing the right job

Submission

The project, i.e. the implementation of the seven functions including any additional ancillary function developed, must be included in a single file named as your group, e.g. best_group_ever.py

The file must be sent by email to me at silvio.peroni@unibo.it

Submission: 2 days before the exam session (the whole group must attend the session) – e.g. send the project the 23rd of January for discussing it on the session of the 25th of January

Your project will be compared with the others in terms of efficiency (i.e. time needed for addressing specific tasks)

Evaluation

Maximum score = c-score + e-score + o-score = 16

All projects will run on large CSV files

Correctness of the result: c-score <= 4

Efficiency of the software: e-score <= 4; projects ranked according to the time spent for addressing various tasks

  1. e-score = 4

  2. e-score = 3

  3. e-score = 2

Oral colloquium: -8 <= o-score <= 8; it is a personal score, each member of the group has its own

At least 2 for passing the exam, otherwise an additional effort is required (new function)

END Project