Co-stardom network 🎬

5 minute read

Introduction

After a movie, my first question always is: “Are the main actors really famous?” and sometimes “Did this pair play in other movies?”. To answer these questions, I created an interactive co-stardom network.

The project is live, have a try!

Description

The graph is interactive. Two actors are connected in the network if they both played in at least one movie.

A screenshot of what the network looks like. Two actors were added: Emma Stone and Brad Pitt. We can visually identify the few actors who played with both of them. The view is filtered on Ryan Gosling. As he is selected, as well as his relationships with the other two, we get more insight in the bottom right panel.

Adding actors

Either search for your favorite actor in the appropriate input field and click Add, or hit to add a random actor.

Selecting actors/relationships

You can click and drag nodes and edges anywhere. A selected node will appear in red and a selected edge in black. Selecting a node reveals the number of connections the actor currently has in the graph, as well as basic personal information on the actor/actress. Selecting an edge reveals the common movies these two actors have. You can select simultaneously multiple nodes and edges with Ctrl/Cmd+click or with a click and drag rectangle box selection while holding Ctrl/Cmd.

Removing actors

Hit to remove everbody from the graph.

Hit to remove the selected actors (those in red).

Hit to clear actors with no relationships in the network.

Alternatively, you can type the actor’s name in the appropriate input field and hit Remove.

Instead of removing an actor, you can filter the view more easily with the automatic filtering text input .

Data source

One major concern is to get a reliable source of data. IMDb provides non-commercial datasets for everyone to use on the form of huge tsv files. For our purpose, we will look at three of these files : title_basics.tsv, name_basics.tsv, title_principals.tsv. These contain respectively : the list of all movies, the list of all actors and the list of casts, i.e. for every impersonation of one actor in a single movie the list of all characters.

The following tables show the first few lines of these files.

	tconst	titleType	primaryTitle	originalTitle	startYear	endYear	runtimeMinutes	genres
0	tt0000001	short	Carmencita	Carmencita	1894	nan	1	Documentary,Short
1	tt0000002	short	Le clown et ses chiens	Le clown et ses chiens	1892	nan	5	Animation,Short
2	tt0000003	short	Pauvre Pierrot	Pauvre Pierrot	1892	nan	4	Animation,Comedy,Romance
3	tt0000004	short	Un bon bock	Un bon bock	1892	nan	12	Animation,Short
9	tt0000010	short	Leaving the Factory	La sortie de l’usine Lumière à Lyon	1895	nan	1	Documentary,Short

nconst	primaryName	birthYear	deathYear	primaryProfession
nm0000001	Fred Astaire	1899	1987	soundtrack,actor,miscellaneous
nm0000002	Lauren Bacall	1924	2014	actress,soundtrack
nm0000003	Brigitte Bardot	1934	nan	actress,soundtrack,music_department
nm0000004	John Belushi	1949	1982	actor,soundtrack,writer
nm0000005	Ingmar Bergman	1918	2007	writer,director,actor
nm0000006	Ingrid Bergman	1915	1982	actress,soundtrack,producer

	tconst	nconst	category
0	tt0000001	nm1588970	self
1	tt0000001	nm0005690	director
2	tt0000001	nm0374658	cinematographer
3	tt0000002	nm0721526	director
4	tt0000002	nm1335271	composer
5	tt0000003	nm0721526	director

Actors are uniquely identified by the nconst id, while movies are identified with the tconst id. The first two tables provide basic information, while the third one links the two identifiers.

Design choices

Database management system

My first thought was to migrate these files to a SQL database because the data is well structured. Unfortunately, I was not able to create a PostgreSQL database on my machine 💀. This is one of the many issues encountered in real life projects.

So I switched to a graph-based database called Neo4j. The advantage of a graph-based database is that the queries would then be written in an almost natural language. The installation works fine and I was able to define a relational model for the three files to describe the structure of the network. However, the free plan of Neo4j is limited to 200000 nodes, which allows only for a very partial upload of the dataset. Because in this project the completeness of the dataset really matters, I switched to MongoDB for its flexibility and ease of use, which allows to store the full dataset in a document-based database.

Going back to a non-graph-based database is not that much of an issue in our case as I only need one single request: “Find all the other actors a single actor ever played with”. If I were to ask for more complex queries, such as “Find all the other actors at a distance lower than two of a given actor” or “What is the distance (minimum number of connections in the network) between two actors?”, then I would need a graph-based database, as the requests would be absolutely inefficient (and tedious to write) in MongoDB.

User interface

Displaying the whole network seemed a bit daunting at first. With hundreds of thousands of actors, it would be an intricate mesh of nodes and edges and require tremendous computation power on both client-side on server-side. Moreover, we usually are interested in a small portion of the network, so I let the user add the desired actors on its own.

To see the code, head to github.

Twitter Facebook LinkedIn

Enguerrand Monard