The aperitif will be on the 16th of December, 19:00, at "Kinotto", Via Sebastiano Serlio, 25/2 (Google Maps, Open Street Maps)
Bus: number "21", stop "Dopolavoro Ferroviario" – itinerary, timetable
The name of the project is Meta Analyser Tool
It is a software that takes in input a file in a particular format (CSV), describing metadata about existing scholarly articles.
The goal of the software is to run particular analysis on such data.
The project must be implemented by a group of people
You need to
form groups of at least 3 and at most 4 people
choose a name for the group (yes: a name) - it will be used to publish the ranks of the best-performing projects (more info later)
communicate the name of the group and its members (including their emails) to me by sending an email at silvio.peroni@unibo.it
All groups must be ready by next Monday at most (16 December)
from <group_name> import * class MetaAnalyserTool(object): def __init__(self, metadata_file_path): self.data = process_metadata(metadata_file_path) def get_ids(self, str_value, field_set): return do_get_ids(self.data, str_value, field_set) def get_by_id(self, id, field_set): return do_get_by_id(self.data, id, field_set) def filter(self, field, value): return do_filter(self.data, field_value_list) def coauthor_graph(self, author_id, level): return do_coauthor_graph(self.data, author_id, level) def author_network(self): return do_author_network(self.data) def retrieve_tree_of_venues(self, no_ids): return do_retrieve_tree_of_venues(self.data, no_ids)
You have to develop a Python file named as your group, where spaces are substituted by underscores and the whole file is lowercase
E.g.: group Best group ever, file best_group_ever.py
The import statement must specify the module implemented
E.g.: from best_group_ever import *
Each group has to implement the seven functions that have been highlighted in red in the previous slide
def process_metadata(metadata_file_path)
It takes in input a comma separated value (CSV) file, and return a data structure containing all the data included in the CSV is some form
The data can be preprocessed, changed according to some empirical rule, ordered in a certain way, etc.
These data will be automatically provided in input to the other functions
id | title | author | pub_date | venue | volume | issue | page | type | publisher |
---|---|---|---|---|---|---|---|---|---|
id:1; b:1 | The SPAR Ontologies | Peroni, Silvio [a:1; viaf:309649450]; Shotton, David [a:2] | 2018 | ISWC 2018 [b:2; b:5] | 119-136 | book chapter | Springer [p:1] | ||
b:3 | Automating semantic publishing | Peroni, Silvio [a:1] | 2017 | Data Science [b:4] | 1 | 1-2 | 155-173 | journal article | IOS Press [p:2] |
id
(list of strings, separator ";"): the identifiers of an article
author
(list of strings, separator ";"): the author of an article
[id1; id2; etc.]
(list of strings, separator ";"): each string of the fields author, venue and publisher can include one or more identifiers for such entity
def do_get_ids(data, str_value, field_set)
data: the data returned by process_metadata
str_value: a string value to search
field_set: the set of the field names to consider
It returns a set of identifiers for any item that matches with str_value
. If the field_set
parameter is None
, it will consider all the fields for looking for the identifier. Otherwise, only the fields listed in field_set
will be used
def do_get_by_id(data, id, field_set)
data: the data returned by process_metadata
id: an identifier of an item
field_set: the set of the field names to consider
It returns a set of strings of the items associated to the identifier id
. If the field_set
parameter is None
, it will consider all the fields for looking for the identifier. Otherwise, only the fields listed in field_set
will be used.
def do_filter(data, field_value_list)
data: the data returned by process_metadata
field_value_list: a list of tuples of two values each
It returns a list with textual representations of each row in the CSV tables which are compliant with the filters specified in field_value_list
. Each tuple of the field_value_list
parameter contains two values: the first one is the field on which to apply the filtering, the second is the value used for filtering - e.g. ("author", "Peron*")
will take into consideration only the rows that, in the field author, have at least one author having name starting with "Peron". If field_value_list
is None
, the function considers all the row in data
Textual representation: <Family Name 1> <Given Name Initials 1>, [...], <Family Name N> <Given Name Initials N>. (<pub_date>). <title>. <venue>
E.g.: Peroni S, Shotton D. (2018). The SPAR Ontologies. ISWC 2018
Multiple wildcards *
can be used in str_value
, id
, and in the second values of the tuples in field_value_list
of the previous functions. E.g. World*Web
looks for all the strings that matches with the word World
followed by zero or more characters, followed by the word Web
(examples: World Wide Web
, World Spider Web
, etc.)
All matches are case insensitive – e.g. specifying World
as str_value
will match also strings that contain world
def do_coauthor_graph(data, aut_id, level)
data: the data returned by process_metadata
aut_id: an identifier of an author
level: a positive integer which says how many levels of the network we have to consider
It returns a graph having a central node author_id
and a series of additional connected nodes for each other person who co-authored a paper with author_id
(level 1), and another series of additional connected nodes referring to the co-authors of all the previosly added ones (level 2), and so on
def do_author_network(data)
data: the data returned by process_metadata
It returns the graph of all the authors included in data
. Two authors are connected in the graph if they have co-authored at least one paper together
def do_retrieve_tree_of_venues(data, no_ids)
data: the data returned by process_metadata
no_ids: the set of venue identifiers not to include in the result
It returns the root node (named "venues") of a tree. All the children of the root node represent all the venues in data
with no repetitions. Each venue has a number of children according to how many distinct volumes it has. Similarly, each volume has a number of children according to how many distinct issues it has. Please use the value of the field venue (with no identifiers), volume, and issue as name of the nodes of the tree. If the no_ids
parameter is None
, it will consider all the venues in data
. Otherwise, the venues listed in no_ids
will be excluded by the result
Use the test-driven development to understand when you are doing the right job
The project, i.e. the implementation of the seven functions including any additional ancillary function developed, must be included in a single file named as your group, e.g. best_group_ever.py
The file must be sent by email to me at silvio.peroni@unibo.it
Submission: 2 days before the exam session (the whole group must attend the session) – e.g. send the project the 23rd of January for discussing it on the session of the 25th of January
Your project will be compared with the others in terms of efficiency (i.e. time needed for addressing specific tasks)
Maximum score = c-score + e-score + o-score = 16
All projects will run on large CSV files
Correctness of the result: c-score <= 4
Efficiency of the software: e-score <= 4; projects ranked according to the time spent for addressing various tasks
e-score = 4
e-score = 3
e-score = 2
e-score = 1
Oral colloquium: -8 <= o-score <= 8; it is a personal score, each member of the group has its own
At least 2 for passing the exam, otherwise an additional effort is required (new function)