Project

Communication 1

The aperitif will be on the 16th of December, 19:00, at "Kinotto", Via Sebastiano Serlio, 25/2 (Google Maps, Open Street Maps)

Bus: number "21", stop "Dopolavoro Ferroviario" – itinerary, timetable

Meta Analyser Tool

The name of the project is Meta Analyser Tool

It is a software that takes in input a file in a particular format (CSV), describing metadata about existing scholarly articles.

The goal of the software is to run particular analysis on such data.

You need a group

The project must be implemented by a group of people

You need to

  • form groups of at least 3 and at most 4 people

  • choose a name for the group (yes: a name) - it will be used to publish the ranks of the best-performing projects (more info later)

  • communicate the name of the group and its members (including their emails) to me by sending an email at silvio.peroni@unibo.it

All groups must be ready by next Monday at most (16 December)

Project stub

from <group_name> import *

class MetaAnalyserTool(object):
    def __init__(self, metadata_file_path):
        self.data = process_metadata(metadata_file_path)

    def get_ids(self, str_value, field_set):
        return do_get_ids(self.data, str_value, field_set)

    def get_by_id(self, id, field_set):
        return do_get_by_id(self.data, id, field_set)

	def filter(self, field, value):
		return do_filter(self.data, field_value_list)

    def coauthor_graph(self, author_id, level):
        return do_coauthor_graph(self.data, author_id, level)

    def author_network(self):
        return do_author_network(self.data)

    def retrieve_tree_of_venues(self, no_ids):
        return do_retrieve_tree_of_venues(self.data, no_ids)

Your output

You have to develop a Python file named as your group, where spaces are substituted by underscores and the whole file is lowercase

E.g.: group Best group ever, file best_group_ever.py

The import statement must specify the module implemented

E.g.: from best_group_ever import *

Each group has to implement the seven functions that have been highlighted in red in the previous slide

process_metadata

def process_metadata(metadata_file_path)

It takes in input a comma separated value (CSV) file, and return a data structure containing all the data included in the CSV is some form

The data can be preprocessed, changed according to some empirical rule, ordered in a certain way, etc.

These data will be automatically provided in input to the other functions

Example of CSV metadata

idtitleauthorpub_datevenuevolumeissuepagetypepublisher
id:1; b:1The SPAR OntologiesPeroni, Silvio [a:1; viaf:309649450]; Shotton, David [a:2]2018ISWC 2018 [b:2; b:5]119-136book chapterSpringer [p:1]
b:3Automating semantic publishingPeroni, Silvio [a:1]2017Data Science [b:4]11-2155-173journal articleIOS Press [p:2]

id (list of strings, separator ";"): the identifiers of an article

author (list of strings, separator ";"): the author of an article

[id1; id2; etc.] (list of strings, separator ";"): each string of the fields author, venue and publisher can include one or more identifiers for such entity

do_get_ids

def do_get_ids(data, str_value, field_set)

data: the data returned by process_metadata

str_value: a string value to search

field_set: the set of the field names to consider

It returns a set of identifiers for any item that matches with str_value. If the field_set parameter is None, it will consider all the fields for looking for the identifier. Otherwise, only the fields listed in field_set will be used

do_get_by_id

def do_get_by_id(data, id, field_set)

data: the data returned by process_metadata

id: an identifier of an item

field_set: the set of the field names to consider

It returns a set of strings of the items associated to the identifier id. If the field_set parameter is None, it will consider all the fields for looking for the identifier. Otherwise, only the fields listed in field_set will be used.

do_filter

def do_filter(data, field_value_list)

data: the data returned by process_metadata

field_value_list: a list of tuples of two values each

It returns a list with textual representations of each row in the CSV tables which are compliant with the filters specified in field_value_list. Each tuple of the field_value_list parameter contains two values: the first one is the field on which to apply the filtering, the second is the value used for filtering - e.g. ("author", "Peron*") will take into consideration only the rows that, in the field author, have at least one author having name starting with "Peron". If field_value_list is None, the function considers all the row in data

Textual representation: <Family Name 1> <Given Name Initials 1>, [...], <Family Name N> <Given Name Initials N>. (<pub_date>). <title>. <venue>

E.g.: Peroni S, Shotton D. (2018). The SPAR Ontologies. ISWC 2018

Wild cards and rules

Multiple wildcards * can be used in str_value, id, and in the second values of the tuples in field_value_list of the previous functions. E.g. World*Web looks for all the strings that matches with the word World followed by zero or more characters, followed by the word Web (examples: World Wide Web, World Spider Web, etc.)

All matches are case insensitive – e.g. specifying World as str_value will match also strings that contain world

do_coauthor_graph

def do_coauthor_graph(data, aut_id, level)

data: the data returned by process_metadata

aut_id: an identifier of an author

level: a positive integer which says how many levels of the network we have to consider

It returns a graph having a central node author_id and a series of additional connected nodes for each other person who co-authored a paper with author_id (level 1), and another series of additional connected nodes referring to the co-authors of all the previosly added ones (level 2), and so on

do_author_network

def do_author_network(data)

data: the data returned by process_metadata

It returns the graph of all the authors included in data. Two authors are connected in the graph if they have co-authored at least one paper together

do_retrieve_tree_of_venues

def do_retrieve_tree_of_venues(data, no_ids)

data: the data returned by process_metadata

no_ids: the set of venue identifiers not to include in the result

It returns the root node (named "venues") of a tree. All the children of the root node represent all the venues in data with no repetitions. Each venue has a number of children according to how many distinct volumes it has. Similarly, each volume has a number of children according to how many distinct issues it has. Please use the value of the field venue (with no identifiers), volume, and issue as name of the nodes of the tree. If the no_ids parameter is None, it will consider all the venues in data. Otherwise, the venues listed in no_ids will be excluded by the result

A suggestion

Use the test-driven development to understand when you are doing the right job

Submission

The project, i.e. the implementation of the seven functions including any additional ancillary function developed, must be included in a single file named as your group, e.g. best_group_ever.py

The file must be sent by email to me at silvio.peroni@unibo.it

Submission: 2 days before the exam session (the whole group must attend the session) – e.g. send the project the 23rd of January for discussing it on the session of the 25th of January

Your project will be compared with the others in terms of efficiency (i.e. time needed for addressing specific tasks)

Evaluation

Maximum score = c-score + e-score + o-score = 16

All projects will run on large CSV files

Correctness of the result: c-score <= 4

Efficiency of the software: e-score <= 4; projects ranked according to the time spent for addressing various tasks

  1. e-score = 4

  2. e-score = 3

  3. e-score = 2

  4. e-score = 1

Oral colloquium: -8 <= o-score <= 8; it is a personal score, each member of the group has its own

At least 2 for passing the exam, otherwise an additional effort is required (new function)

END Project