The aperitif will be on the 19th of December, 19:00, at "Kinotto", Via Sebastiano Serlio, 25/2 (Google Maps, Open Street Maps)
Bus: number "21", stop "Dopolavoro Ferroviario" – itinerary, timetable
The name of the project is Scholarly Network Engine
It is a software that takes in input two files in a particular format (CSV)
The goal of the software is to run particular analysis on the various networks generated by all these information
The project must be implemented by a group of people
You need to
form groups of at least 3 and at most 4 people
choose a name for the group (yes: a name) - it will be used to publish the ranks of the best-performing projects (more info later)
communicate the name of the group and its members (including their emails) to me by sending an email at silvio.peroni@unibo.it
All groups must be ready by next Wednesday at most (19 December)
from <group_name> import * class ScholarlyNetworkEngine(object): def __init__(self, metadata_file_path, citation_file_path): self.sse = ScholarlySearchEngine(metadata_file_path) self.data = process_citation_data(citation_file_path) def citation_graph(self): return do_citation_graph(self.data, self.sse) def coupling(self, doi_1, doi_2): return do_coupling(self.data, self.sse, doi_1, doi_2) def aut_coupling(self, aut_1, aut_2): return do_aut_coupling(self.data, self.sse, aut_1, aut_2) def aut_distance(self, aut): return do_author_distance(self.data, self.sse, aut) def find_cycles(self): return do_find_cycles(self.data, self.sse) def cit_count_year(self, aut, year): return do_cit_count_year(self.data, self.sse, aut, year)
You have to develop a Python file named as your group, where spaces are substituted by underscores and the whole file is lowercase
E.g.: group Best group ever, file best_group_ever.py
The import statement must specify the module implemented
E.g.: from best_group_ever import *
Each group has to implement the seven functions that have been highlighted in red in the previous slide
It is possible to reuse all the methods of the Scholarly Search Engine, if needed (and suggested)
A new instance of the Scholarly Search Engine is created every time a new Scholarly Network Engine is instantiated
The Scholarly Search Engine was past year's project, and the main class makes available several methods for analysing and querying the table of the metadata of scholarly document passed as input
Suggestion: reuse its methods when possible
def process_citation_data(file_path)
It takes in input a comma separated value (CSV) file, and return a data structure containing all the data included in the CSV is some form
The data can be preprocessed, changed according to some empirical rule, ordered in a certain way, etc.
These data will be automatically provided in input to the other functions
doi | cited by | known refs |
---|---|---|
10.7717/peerj-cs.147 | 2 | 10.7717/peerj-cs.1; 10.7717/peerj-cs.86 |
... | ... | ... |
... | ... | ... |
doi (string): the identifier of an article
cited by (integer): the number of other articles that are citing the one identified by doi
, overall
known refs (list of strings: doi 1; doi 2; ...): the DOIs of the articles managed by the Scholarly Search Engine that are cited by doi
def do_citation_graph(data, sse)
data: the data returned by process_citation_data
sse: an instance of the Scholarly Search Engine
It returns a directed graph (hint: use NetworkX) describing the citation network between the articles managed by sse
. The particular identifier of each node in the graph should be a string representing the article (see the pretty_print
method of sse
)
The articles that are not involved in any citation are not included in the graph
def do_coupling(data, sse, doi_1, doi_2)
data: the data returned by process_citation_data
sse: an instance of the Scholarly Search Engine
doi_1: a DOI identifying an article
doi_2: a DOI identifying another article
It returns an non-negative integer which indicates the coupling strength of two given documents, i.e. how many identical documents they cite
def do_aut_coupling(data, sse, aut_1, aut_2)
data: the data returned by process_citation_data
sse: an instance of the Scholarly Search Engine
aut_1: an author's full name
aut_2: another author's full name
It returns a non-negative integer which indicates the coupling strength of two authors, i.e. how many identical documents have been cited by their unique body of work (i.e. all the documents they wrote except those ones they coauthored)
def do_aut_distance(data, sse, aut)
data: the data returned by process_citation_data
sse: an instance of the Scholarly Search Engine
aut: an author's full name
It returns the graph of all the authors (nodes) reachable from aut
(included) following only coauthorship relations (edges). Each edge has the attribute co_authored_papers
(i.e. the number of papers coauthored by the linked authors), and each node has the attribute distance
, i.e. the minimum number of edges needed to reach aut
def do_find_cycles(data, sse)
data: the data returned by process_citation_data
sse: an instance of the Scholarly Search Engine
It returns a list of tuples, where each tuple contains the sequence of nodes that form a cycle in the citation network i.e. a path that, starting from a node, allows one to go back to that node by following the citation links availables
def do_cit_count_year(data, sse, aut, year)
data: the data returned by process_citation_data
sse: an instance of the Scholarly Search Engine
aut: an author's full name
year: an integer representing an year
It returns a key:value
dictionary of integers: key
represents an year, while value
is the sum of all the citations received by the papers authored by aut
in that particular year
If the parameter year
is not None, the keys lesser than year
are excluded from the dictionary
If the parameter year
is None, the lower publication year of the articles authored by aut
is considered as starting year
If aut
does not have any citations in a particular year, then the dictionary must specify it anyway, with 0 citations
Use the test-driven development to understand when you are doing the right job
The project, i.e. the implementation of the seven functions including any additional ancillary function developed, must be included in a single file named as your group, e.g. best_group_ever.py
The file must be sent by email to me at silvio.peroni@unibo.it
Submission: 2 days before the exam session (the whole group must attend the session) – e.g. send the project the 23rd of January for discussing it on the session of the 25th of January
Your project will be compared with the others in terms of efficiency (i.e. time needed for addressing specific tasks)
Maximum score = c-score + e-score + o-score = 16
All projects will run on large CSV files
Correctness of the result: c-score <= 4
Efficiency of the software: e-score <= 4; projects ranked according to the time spent for addressing various tasks
e-score = 4
e-score = 3
e-score = 2
Oral colloquium: -8 <= o-score <= 8; it is a personal score, each member of the group has its own
At least 2 for passing the exam, otherwise an additional effort is required (new function)