AssistQ Dataset
In each data folder, there are several files:
(1) video.mp4 / video.mov: instructional video;
(2) script.txt: the video script with the timestamp.
0:00:00-0:00:04 How to start, stop, start and stop airfryer? Turn the temperature knob anticlockwise to 120 degrees.
0:00:04-0:00:07 Turn the time knob clockwise to 10 minutes.
...
The meaning of the annotation (from left to right): start time-end time: text script. The time format follows HH:MM:SS.
(3) buttons.csv: button bounding-box annotation.
button1,362,86,185,72,airfryer-user.jpg,960,1280
button2,378,330,185,170,airfryer-user.jpg,960,1280
...
The meaning of the annotation (from left to right): button name, top-left x, top-left y, width, height, image filename, image width, image height.
(4) images/ folder: the folder contains the image files mentioned in buttons.csv.
Question-Answer Annotations: we aggregate the annotations of all data samples in train.json. A video can have multiple questions
, and each question needs to be answered in multiple steps and multiple modalities. Specifically, each data index (e.g., coffeemachine_d2stw, diffuser_lxcd4) corresponds to a list that contains multiple question-answer pairs:
{'aircon_utr3b': [{...}, {...}, {...}, {...}, {...}, {...}], 'airfryer_gye82': [{...}, {...}, {...}, {...}, {...}], 'airfryer_pe2j7': [{...}, {...}, {...}, {...}, {...}, {...}, {...}, {...}, {...}], 'airfryer_w9rzm': [{...}, {...}, {...}, {...}, {...}, {...}, {...}, {...}, {...}, ...], 'bicycle_g8h94': [{...}, {...}, {...}], ...}
For each data sample, there are multiple question-answer pairs:
[
{
"question": "How to bake a cake at 120 degrees for 15 minutes?",
"answers": [
["Turn <button1> clockwise", "Turn <button1> anticlockwise", "Turn <button2> clockwise", "Turn <button2> anticlockwise to 0 minutes", "Turn <button1> to 200 degrees", "Turn <button1> to 120 degrees", "Turn <button1> to 180 degrees", "Turn <button2> clockwise to 3 minutes", "Turn <button2> clockwise to 10 minutes", "Turn <button2> clockwise to 15 minutes"],
["Turn <button1> clockwise", "Turn <button1> anticlockwise", "Turn <button2> clockwise", "Turn <button2> anticlockwise to 0 minutes", "Turn <button1> to 200 degrees", "Turn <button1> to 120 degrees", "Turn <button1> to 180 degrees", "Turn <button2> clockwise to 3 minutes", "Turn <button2> clockwise to 10 minutes", "Turn <button2> clockwise to 15 minutes"]
], # candidate answers of each step (2 steps in this case)
"correct": [6, 10], # correct answer index of each step (starting from 1). We would not release this in the testing set
"images": ["airfryer-user.jpg", "airfryer-user.jpg"] # user view image of each step, mentioned in buttons.csv
},
{...},
...
]